[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow

🐛 Bug

I just started a training run for a full fine-tune of DeepSeek-Coder-V2-Lite-Base MOE model (16B params, 2.4B active) on an 8X80GB A100 machine, and the LLM Studio UX is saying its going to take nearly 3 days to finish (I have about 65K training pairs, for comparison it takes 1.5-2 hours to train Llama 3 7B (full fine-tune) and maybe 16 hours to train LLama 3 70B). Any ideas on what might be going on? I know VLLM needed a patch to run the model, not sure if there are optimizations needed in Torch that haven't landed to make it run more quickly.

Below is my nvidia-smi output during the run.

(base) ubuntu@207-211-184-180:~$ nvidia-smi
Wed Jun 26 23:23:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   51C    P0             202W / 400W |  52743MiB / 81920MiB |     94%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   48C    P0             214W / 400W |  53183MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   44C    P0             151W / 400W |  53149MiB / 81920MiB |     46%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   48C    P0             202W / 400W |  53529MiB / 81920MiB |     38%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   49C    P0             153W / 400W |  53513MiB / 81920MiB |     79%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   44C    P0             211W / 400W |  53087MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   47C    P0             215W / 400W |  53567MiB / 81920MiB |     73%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   50C    P0              98W / 400W |  52923MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1407823      C   ...nvs/h2o_llm_grid_search/bin/python3    52730MiB |
|    1   N/A  N/A   1407824      C   ...nvs/h2o_llm_grid_search/bin/python3    53170MiB |
|    2   N/A  N/A   1407825      C   ...nvs/h2o_llm_grid_search/bin/python3    53136MiB |
|    3   N/A  N/A   1407826      C   ...nvs/h2o_llm_grid_search/bin/python3    53516MiB |
|    4   N/A  N/A   1407827      C   ...nvs/h2o_llm_grid_search/bin/python3    53500MiB |
|    5   N/A  N/A   1407828      C   ...nvs/h2o_llm_grid_search/bin/python3    53074MiB |
|    6   N/A  N/A   1407829      C   ...nvs/h2o_llm_grid_search/bin/python3    53554MiB |
|    7   N/A  N/A   1407830      C   ...nvs/h2o_llm_grid_search/bin/python3    52910MiB |
+---------------------------------------------------------------------------------------+

h2oai / h2o-llmstudio

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

🐛 Bug