[BUG] Exporting / downloading model larger, than VRAM available (trained with DeepSpeed) fails

AlexanderZhk commented 5 months ago

I trained a 33b model with DeepSpeed on 40GB cards. Based on the traceback, the model seems to be too large to fit into one GPU. Is it possible to fall back on the CPU for cases like this?

the .pth file is ~67 GB, so obviously it won't fit in ~~CPU~~ GPU (edit: GPU, obviously)


script_sources: ['/_f/b7768783-3906-4c38-8849-ca80666c7f7b/tmpif8apv9l.min.js']
initialized: True
version: 1.5.0-dev
name: H2O LLM Studio
heap_mode: False
wave_utils_stack_trace_str: ### stacktrace
Traceback (most recent call last):

  File "/workspace/./llm_studio/app_utils/handlers.py", line 332, in handle
    await experiment_download_model(q)

  File "/workspace/./llm_studio/app_utils/sections/experiment.py", line 1627, in experiment_download_model
    cfg, model, tokenizer = load_cfg_model_tokenizer(

  File "/workspace/./llm_studio/app_utils/sections/chat.py", line 197, in load_cfg_model_tokenizer
    model = cfg.architecture.model_class(cfg)

  File "/workspace/./llm_studio/src/models/text_causal_language_modeling_model.py", line 32, in __init__
    self.backbone, self.backbone_config = create_nlp_backbone(

  File "/workspace/./llm_studio/src/utils/modeling_utils.py", line 784, in create_nlp_backbone
    backbone = model_class.from_config(config, **kwargs)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 435, in from_config
    return model_class._from_config(config, **kwargs)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1307, in _from_config
    model = cls(config, **kwargs)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1103, in __init__
    self.model = LlamaModel(config)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
    [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 699, in __init__
    self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 288, in __init__
    self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 98, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))

  File "/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacity of 39.38 GiB of which 3.81 MiB is free. Process 268834 has 414.00 MiB memory in use. Process 274612 has 38.96 GiB memory in use. Of the allocated memory 38.27 GiB is allocated by PyTorch, and 219.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

q.user
q.client
app_db: <llm_studio.app_utils.db.Database object at 0x7fb9245d5d20>
client_initialized: True
mode_curr: error
theme_dark: True
credential_saver: .env File
default_aws_bucket_name: bucket_name
default_azure_conn_string: 
default_azure_container: 
default_kaggle_username: 
set_max_epochs: 50
set_max_batch_size: 256
set_max_gradient_clip: 10
set_max_lora_r: 256
set_max_lora_alpha: 256
gpu_used_for_chat: 1
default_number_of_workers: 8
default_logger: None
default_neptune_project: 
default_openai_azure: False
default_openai_api_base: https://example-endpoint.openai.azure.com
default_openai_api_deployment_id: deployment-name
default_openai_api_version: 2023-05-15
default_gpt_eval_max: 100
default_safe_serialization: True
delete_dialogs: True
chart_plot_max_points: 1000
init_interface: True
notification_bar: None
nav/active: experiment/list
experiment/list/mode: train
dataset/list/df_datasets:    id                          name  ... validation rows                    labels
5   7  FIXSynthetic_v1_withSeatable  ...              28                    answer
4   6     Synthetiv_v1_withSeatable  ...              28                    answer
3   5            SeatableValidation  ...              28                      ideal answer
2   4             synthetic_full_v1  ...            None                    answer
1   2                           dpo  ...            None                    chosen
0   1                         oasst  ...            None                    output

[6 rows x 10 columns]
experiment/list/df_experiments:      id  ...               info
31  109  ...  Runtime: 06:37:55
30  108  ...          OOM error
29  107  ...  Runtime: 02:44:26
28  106  ...  Runtime: 00:42:06
27  105  ...  Runtime: 01:23:17
26  104  ...  Runtime: 01:23:22
25  103  ...  Runtime: 01:01:08
24  102  ...  Runtime: 00:30:30
23   92  ...                   
22   88  ...  Runtime: 00:41:06
21   87  ...           See logs
20   85  ...  Runtime: 03:39:11
19   83  ...  Runtime: 01:42:29
18   82  ...  Runtime: 00:06:36
17   81  ...  Runtime: 01:42:20
16   66  ...  Runtime: 04:53:16
15   56  ...  Runtime: 00:35:02
14   55  ...  Runtime: 00:17:31
13   48  ...  Runtime: 00:28:52
12   43  ...  Runtime: 00:02:51
11   42  ...  Runtime: 00:25:08
10   41  ...  Runtime: 00:38:22
9    37  ...  Runtime: 00:40:52
8    21  ...  Runtime: 06:52:30
7    18  ...  Runtime: 00:40:25
6    17  ...  Runtime: 00:39:37
5    16  ...  Runtime: 02:01:39
4    13  ...  Runtime: 01:17:46
3    12  ...  Runtime: 02:37:27
2    11  ...  Runtime: 00:35:32
1    10  ...  Runtime: 02:21:06
0     8  ...  Runtime: 00:40:47

[32 rows x 16 columns]
expander: True
dataset/list: False
dataset/list/table: []
experiment/list: True
experiment/list/table: ['0']
__wave_submission_name__: report_error
experiment/list/refresh: False
experiment/list/compare: False
experiment/list/stop: False
experiment/list/delete: False
experiment/list/new: False
experiment/list/rename: False
experiment/list/stop/table: False
experiment/list/delete/table/dialog: False
experiment/display/id: 0
experiment/display/logs_path: None
experiment/display/preds_path: None
experiment/display/tab: experiment/display/charts
experiment/display/experiment_id: 109
experiment/display/experiment: <llm_studio.app_utils.db.Experiment object at 0x7fb923bcd1e0>
experiment/display/experiment_path: /workspace/output/user/wrong_prompt_dividers_33b.1/
experiment/display/charts: {'cfg': {'experiment_name': 'wrong_prompt_dividers_33b.1', 'llm_backbone': 'deepseek-ai/deepseek-coder-33b-instruct', 'personalize': False, 'chatbot_name': 'h2oGPT', 'chatbot_author': 'H2O.ai', 'train_dataframe': '/workspace/data/user/synthetic_full_v1/synthetic_full_v1.csv', 'validation_strategy': 'automatic', 'validation_dataframe': 'None', 'validation_size': 0.2, 'data_sample': 1.0, 'data_sample_choice': ['Train', 'Validation'], 'system_column': 'system', 'prompt_column': ('full user_prompt',), 'answer_column': 'answer', 'parent_id_column': 'None', 'text_system_start': '<｜begin▁of▁sentence｜>', 'text_prompt_start': 'Instruction:', 'text_answer_separator': 'Response:', 'limit_chained_samples': False, 'add_eos_token_to_system': False, 'add_eos_token_to_prompt': False, 'add_eos_token_to_answer': True, 'mask_prompt_labels': False, 'max_length_prompt': 16384, 'max_length_answer': 16384, 'max_length': 16384, 'add_prompt_answer_tokens': True, 'padding_quantile': 1.0, 'use_fast': False, 'backbone_dtype': 'float16', 'gradient_checkpointing': True, 'force_embedding_gradients': False, 'intermediate_dropout': 0.0, 'pretrained_weights': '', 'loss_function': 'TokenAveragedCrossEntropy', 'optimizer': 'Adam', 'learning_rate': 0.0001, 'differential_learning_rate_layers': [], 'differential_learning_rate': 1e-05, 'use_flash_attention_2': False, 'batch_size': 1, 'epochs': 5, 'schedule': 'Cosine', 'warmup_epochs': 0.0, 'weight_decay': 0.0, 'gradient_clip': 0.0, 'grad_accumulation': 1, 'lora': True, 'lora_r': 128, 'lora_alpha': 128, 'lora_dropout': 0.1, 'lora_target_modules': '', 'save_best_checkpoint': False, 'evaluation_epochs': 1.0, 'evaluate_before_training': False, 'train_validation_data': False, 'token_mask_probability': 0.0, 'skip_parent_probability': 0.0, 'random_parent_probability': 0.0, 'neftune_noise_alpha': 0.0, 'metric': 'BLEU', 'metric_gpt_model': 'gpt-3.5-turbo-0301', 'metric_gpt_template': 'general', 'min_length_inference': 2, 'max_length_inference': 256, 'max_time': 120.0, 'batch_size_inference': 0, 'do_sample': False, 'num_beams': 1, 'temperature': 0.0, 'repetition_penalty': 1.0, 'stop_tokens': '', 'top_k': 0, 'top_p': 1.0, 'gpus': ['0', '1', '2', '3'], 'mixed_precision': True, 'compile_model': False, 'use_deepspeed': True, 'deepspeed_method': 'ZeRO3', 'deepspeed_allgather_bucket_size': 1000000, 'deepspeed_reduce_bucket_size': 1000000, 'deepspeed_stage3_prefetch_bucket_size': 1000000, 'deepspeed_stage3_param_persistence_threshold': 1000000, 'find_unused_parameters': False, 'trust_remote_code': True, 'huggingface_branch': 'main', 'number_of_workers': 8, 'seed': -1, 'logger': 'None', 'neptune_project': ''}, 'validation': {'BLEU': {'steps': [436, 876, 1316, 1756, 2196], 'values': [5.185628121873809, 4.8247619005813105, 5.378364072682938, 5.021293141909062, 4.948287839356161]}}, 'df': {'train_data': '/workspace/output/user/wrong_prompt_dividers_33b.1/batch_viz.parquet', 'validation_predictions': '/workspace/output/user/wrong_prompt_dividers_33b.1/validation_viz.parquet'}, 'train': {'loss': {'steps': [], 'values': [1.0, 2.0, 3.0, 4.0, 5.0]}}}
experiment/display/refresh: False
experiment/display/download_logs: False
experiment/display/download_predictions: False
experiment/display/download_model: True
experiment/display/push_to_huggingface: False
experiment/list/current: False
home: False
report_error: True
q.events
q.args
report_error: True
__wave_submission_name__: report_error
stacktrace
Traceback (most recent call last):

File “/workspace/./llm_studio/app_utils/handlers.py”, line 332, in handle await experiment_download_model(q)

File “/workspace/./llm_studio/app_utils/sections/experiment.py”, line 1627, in experiment_download_model cfg, model, tokenizer = load_cfg_model_tokenizer(

File “/workspace/./llm_studio/app_utils/sections/chat.py”, line 197, in load_cfg_model_tokenizer model = cfg.architecture.model_class(cfg)

File “/workspace/./llm_studio/src/models/text_causal_language_modeling_model.py”, line 32, in init self.backbone, self.backbone_config = create_nlp_backbone(

File “/workspace/./llm_studio/src/utils/modeling_utils.py”, line 784, in create_nlp_backbone backbone = model_class.from_config(config, **kwargs)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py”, line 435, in from_config return model_class._from_config(config, **kwargs)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/modeling_utils.py”, line 1307, in _from_config model = cls(config, **kwargs)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 1103, in init self.model = LlamaModel(config)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 924, in init [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 924, in [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 699, in init self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 288, in init self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/nn/modules/linear.py”, line 98, in init self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/torch/utils/_device.py”, line 77, in torch_function return func(*args, **kwargs)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacity of 39.38 GiB of which 3.81 MiB is free. Process 268834 has 414.00 MiB memory in use. Process 274612 has 38.96 GiB memory in use. Of the allocated memory 38.27 GiB is allocated by PyTorch, and 219.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Error
None

Git Version
fatal: not a git repository (or any of the parent directories): .git

AlexanderZhk commented 5 months ago

This could be somewhat easily solved with something like this (highlighted pseudocode) in llm_studio/app_utils/sections/experiment.py

pascal-pfeiffer commented 5 months ago

Thank you for reporting. When pushing the model to huggingface hub or downloading from the UI, these weights will be automatically sharded into smaller chunks (by default safetensors with I believe 5 GB each).

Is this happing only when using the weights from an old experiment to continue training with "Use previous experiment weights"?

AlexanderZhk commented 5 months ago

Is this happing only when using the weights from an old experiment to continue training with "Use previous experiment weights"?

No, this is happening with a "freshly" trained model, I haven't tried the "Use previous experiment weights" option, yet. 'pretrained_weights': '' in this experiment, which I'm suspecting this option sets. Happens as soon as I click "Download model".

I'm pretty sure it is because it tries to load the whole model into device, which is gpu[0].

Causing this if statment to evaluate to True (it sets device to CPU), either by forking the code or by running an experiment in the background lets me download the model https://github.com/h2oai/h2o-llmstudio/blob/d65fffb3ffa453c519670a9f19ff20335ee7f2dc/llm_studio/app_utils/sections/experiment.py#L1621-L1630

AlexanderZhk commented 5 months ago

Thank you for reporting. When pushing the model to huggingface hub or downloading from the UI, these weights will be automatically sharded into smaller chunks (by default safetensors with I believe 5 GB each).

Just double checked, it "crashes" before reaching the sharding.

pascal-pfeiffer commented 5 months ago

Sorry, can't fully follow. So you are using a local model or a model from Huggingface to start your Experiment?

And what is the next step? You can load a sharded model into multiple GPUs using Deepspeed. Training entirely on CPU is too slow, so we will not be supporting this in H2O LLM Studio.

psinger commented 5 months ago

When you are pushing a finished model to HF, you can choose the device:

AlexanderZhk commented 5 months ago

Sorry, can't fully follow. So you are using a local model or a model from Huggingface to start your Experiment?

starting with a model from HF

And what is the next step? You can load a sharded model into multiple GPUs using Deepspeed. Training entirely on CPU is too slow, so we will not be supporting this in H2O LLM Studio.

this happens after training the model.

the issue happens when downloading the model, exporting it, using this button

When you are pushing a finished model to HF, you can choose the device:

is it possible to choose the device, when downloading the model? (not pushing to hf?)

psinger commented 5 months ago

We need to also add a device window there then. Need to see how easy that's doable.

For now the workaround is to hardcode it in code.

psinger commented 1 month ago

There is now a setting that should solve this: https://github.com/h2oai/h2o-llmstudio/pull/795

h2oai / h2o-llmstudio

[BUG] Exporting / downloading model larger, than VRAM available (trained with DeepSpeed) fails #670