h2oai / h2o-llmstudio

H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs. Documentation: https://docs.h2o.ai/h2o-llmstudio/
https://h2o.ai
Apache License 2.0
3.89k stars 402 forks source link

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

Open tmostak opened 2 months ago

tmostak commented 2 months ago

🐛 Bug

I just started a training run for a full fine-tune of DeepSeek-Coder-V2-Lite-Base MOE model (16B params, 2.4B active) on an 8X80GB A100 machine, and the LLM Studio UX is saying its going to take nearly 3 days to finish (I have about 65K training pairs, for comparison it takes 1.5-2 hours to train Llama 3 7B (full fine-tune) and maybe 16 hours to train LLama 3 70B). Any ideas on what might be going on? I know VLLM needed a patch to run the model, not sure if there are optimizations needed in Torch that haven't landed to make it run more quickly.

Below is my nvidia-smi output during the run.

(base) ubuntu@207-211-184-180:~$ nvidia-smi
Wed Jun 26 23:23:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   51C    P0             202W / 400W |  52743MiB / 81920MiB |     94%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   48C    P0             214W / 400W |  53183MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   44C    P0             151W / 400W |  53149MiB / 81920MiB |     46%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   48C    P0             202W / 400W |  53529MiB / 81920MiB |     38%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   49C    P0             153W / 400W |  53513MiB / 81920MiB |     79%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   44C    P0             211W / 400W |  53087MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   47C    P0             215W / 400W |  53567MiB / 81920MiB |     73%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   50C    P0              98W / 400W |  52923MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1407823      C   ...nvs/h2o_llm_grid_search/bin/python3    52730MiB |
|    1   N/A  N/A   1407824      C   ...nvs/h2o_llm_grid_search/bin/python3    53170MiB |
|    2   N/A  N/A   1407825      C   ...nvs/h2o_llm_grid_search/bin/python3    53136MiB |
|    3   N/A  N/A   1407826      C   ...nvs/h2o_llm_grid_search/bin/python3    53516MiB |
|    4   N/A  N/A   1407827      C   ...nvs/h2o_llm_grid_search/bin/python3    53500MiB |
|    5   N/A  N/A   1407828      C   ...nvs/h2o_llm_grid_search/bin/python3    53074MiB |
|    6   N/A  N/A   1407829      C   ...nvs/h2o_llm_grid_search/bin/python3    53554MiB |
|    7   N/A  N/A   1407830      C   ...nvs/h2o_llm_grid_search/bin/python3    52910MiB |
+---------------------------------------------------------------------------------------+
psinger commented 1 month ago

I can confirm the same observation. Did you try if single gpu is different?

But in general they use a custom code for the model not directly integrated into HF. MoE models frequently have their hic ups in terms of runtime also.

If you have experience with their models, happy for some investigations and contributions.