pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
2.64k stars 205 forks source link

Very low wps with H200 Gpus #676

Open aniltrkkn opened 1 week ago

aniltrkkn commented 1 week ago

Hello, I am running the multinode_trainer.slurm (llama3_70b.toml) on 4 nodes that have 32 H200 Gpus. However, wps is only around ~200. Any ideas what can cause this slowness?

output.txt

multinode_trainer.slurm.txt

lessw2020 commented 1 week ago

Hi @aniltrkkn - hard to say without a trace, but most likely something is amiss on the 'between node' connections for your setup and creating this slowness.

You mentioned you are using the multinode_slurm script - are you running this on AWS? There are some settings in there that were to ensure EFA is used for cross node comms, but it was never tested with H200 as AWS did not have them at that point and we no longer have AWS cluster access.

AWS has up to 3200 GBs for H100/H200 but it could be the settings need to be adjusted for EFA with the new H200s.

A couple options here: a - if you can confirm you are on AWS (please confirm exactly what you are using ala EC2, etc) I can reach out to their SA's to review the multi-node slurm script and see what might need adjusting.

b - if you are not on AWS, you could adjust it directly for your hardware as the script assumes EFA is available so it might need tuning to leverage your higher speed node interconnect. Assuming my guess re: the likel issue is the between node network speed is correct.

c - finally, you could also run the same test you ran above but turn on profiling in the toml and get a trace or two and that would confirm where the slowdown is. You can gz compress a trace and post here as it should shrink it down to minor size and happy to take a look.

lessw2020 commented 1 week ago

btw, a quick test would also be just run the same short run on llama3-8b with FSDP only and using a single node and lets see how your wps looks there. That should be quite fast but if that is also slow, then the issue is within node rather than between node and would help ensure we bisecting the issue properly.

aniltrkkn commented 1 week ago

Hi @lessw2020, Thank you very much for your response. Here are my responses to your questions:

I am getting high wps with single node 8B trainings.

We are not using AWS for trainings so I need to check if we EFA is available in our training datacenter. But, our multi node training code works fine in the same cluster.

I am attaching the profile trace for one of the multi-node 70B trainings. profile_trace.tar.gz (It seems like it has a lot of cpu_op calculations, maybe that is the issue)

yifuwang commented 1 week ago

Hmm the slurm script you posted says CUDA_LAUNCH_BLOCKING=0, but the trace looked like it was run with CUDA_LAUNCH_BLOCKING=1. Could you double check this?

aniltrkkn commented 1 week ago

Hi @yifuwang, i made it 0 and it is still very low. It seems like we don't support EFA, and we use InfiniBand. I tried our regular parameters

export CUDA_LAUNCH_BLOCKING=0 export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_HCA=ibp export NCCL_MIN_CTAS=32 export UCX_NET_DEVICES=ibp0:1,ibp1:1,ibp2:1,ibp3:1,ibp4:1,ibp5:1,ibp6:1,ibp7:1 export SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 export NCCL_COLLNET_ENABLE=0 export NCCL_DEBUG=INFO export NCCL_ALGO=NVLSTREE

but they also do not help. Is EFA absolutely necessary?

awgu commented 6 days ago

@aniltrkkn Could you share the new trace? Either way, you are heavily communication bound (e.g. FSDP all-gather is >2x longer than the forward compute). Perhaps, could you try HSDP (no TP) with 8-way sharding to keep the FSDP all-gather/reduce-scatter within node?