Suboptimal usage of 8xH100 GPUs - Streaming dataloader speed significantly fluctuates across batches

VSehwag commented 4 months ago

Setup

Environment: Pytorch 2.3.0, composer 0.22.0, streaming 0.7.4
GPU: 8xH100 sxm, BF16 mode

This issue is related #643 but concerns a more subtle issue with Streaming datasets. Over the course of training, we observe that the streaming dataloader speed randomly takes a dip. This is predominant between epochs but also happens at random steps. This is true for all model sizes we tested (a single layer to 1B param model).

Our current setup includes:

Streaming dataset with a single stream - dataset size 0.2-2T
Data is residing locally (remote set to None) in uncompressed format.
Workers=4 and prefetch factor=4 (per gpu), persistent_workers=True, pin_memory=True
Global batch=2048, microbatch=256
fsdp with gradient_only sharding
Process launched with Composer

Our overall setup is same as diffusion training from https://github.com/mosaicml/diffusion i.e., we launch the script using composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-256.yaml

The epoch size is 652 batches where the dataloader gets stuck and take a lot of time. However the drop is throughput is also there at random steps between epochs. This test was done on a relatively small dataset with 1.3M (652x2048) samples.

x-axis: training steps	x-axis: Wall-clock time

We also test it with two other datasets with disk size 1T and 2T (and corresponding epochs size of 2k and 5k batches) and observe the same drops in throughput. The plot of the right shows that the dataloader often hangs for more than a minute. Two subtle issues happening here:

With the 2x larger dataset, somehow the training throughput is slightly lower. The index.json of both are 1.2M and 3.1M size on disk. Both runs have identical setup.
Somehow the fast training with 1T dataset get stuck for longer (wider dips in the green curve).

We have tried a prefetch factor of up to 8 (with eight workers) and 2 (with 2 workers) for both datasets and didn't observe any resolution to the drops. We also ablate the fsdp mode to full_shard and no_shard but the dips in throughput presists. We used a 140M parameter model but the dips are there with another 1B model. We created the mds dataset by processing our raw data using 8 processed and then merged the index.json files.

Somehow the issue is less severe in the single-gpu training. We only observe the drop in throughput at the end of epoch (2k batches). In all previous tests we had used the 8 gpus ddp (with grad_sharding fsdp config). Note that we don't observe a perfectly linear speedup (8.5 batch/s on 8 gpus vs 1.4 batch/s on 1 gpu) which indicated a IO bound and could contribute to the worse dips in multi-gpu setting.

Overall there are three puzzling questions:

It's unclear why does streaming datasets performance suddenly drops throughout training? Even more so, as noticed in #643, the drop is most certainly happening between epochs (data is residing locally) but our large scale testing shows that it happens as prominently between epochs too.
As the size of dataset doubles (thus index.json of double size) with everything else being identical, why the overall throughput of the training slightly reduces? However with faster throughput somehow the datalodaer gets stuck for longer wall-clock time.
Can the drop in throughput at the end of epoch and between epochs be handled with separate fixes. The former seems to be related to some configs in streaming datasets (as it happens even in single gpu when the training is not certainly I/O bound). The dips between training epochs on multi-gpu could be because of I/O bound on disk (though unclear).

snarayan21 commented 4 months ago

Thanks for raising this issue @VSehwag! So we have seen drops in throughput between epochs, but given that your data is residing locally, the time between epochs shouldn't be very long. Could you make sure that drop_last=True in your DataLoader, and could you also try setting persistent_workers=True for the DataLoader? This will keep workers alive between epochs and should help decrease the inter-epoch downtime.

As for throughput drops during training, what is the memory usage of your GPUs during training? PyTorch has a known issue (a probable memory leak) so disabling garbage collection and only performing it once every N steps has resolved this sort of throughput issue in the past. You can also increase the StreamingDataset predownload argument to allow the dataset to download more samples in advance. Lastly, the Streaming Simulator can help you estimate what sort of training throughput you should expect, given your dataset and model iteration time.

Hope this helps.

VSehwag commented 4 months ago

Thanks for taking a look into it. I am currently using persistent workers with dropping last batch, so both drop_last=True and persistent_workers=True. I've observed that setting persistent workers to False increases the wait time between epochs. The major concern for us is the wait time at random steps during the epoch.

The memory usage of each gpu is ~35/80G. I am not sure if any memory leak is happening as we don't see the gpu memory increasing over steps. Just to clarify, the data is fully residing locally (remote is set to None). So the predownload setting and identifying cache limits and other bottlenecks from simulator aren't applicable in this case.

snarayan21 commented 4 months ago

@VSehwag Ah right, I forgot your data was local. Since you're using composer, you can refer to the scheduled gc callback we have in LLM foundry, which should work for your training setup as well. Here's documentation with info on how to enable callbacks. Even though GPU memory may not seem to increase, we have seen that only doing garbage collection at specified intervals will resolve throughput fluctuation/degradation over the course of training.

To narrow down whether this issue is with Streaming, could you try training without Streaming and check the throughput?

karan6181 commented 3 months ago

@VSehwag Wondering if you have gotten a chance to try @snarayan21 suggestion?

snarayan21 commented 1 month ago

Hey @VSehwag --

@XiaohanZhangCMU recently merged the above fix that addresses inter-epoch hangs. I'm wondering if it might also be affecting your runs. We're going to be cutting a new release soon, but in the meantime, feel free to retry from main!

vsehwag-sony commented 1 month ago

Thanks a lot for following up on the thread. Let me pull the latest changes and check if it resolves the issue.

XiaohanZhangCMU commented 3 weeks ago

Hey @VSehwag want to follow up here to see if the fix from the recent release works for your workloads.

mosaicml / streaming

Suboptimal usage of 8xH100 GPUs - Streaming dataloader speed significantly fluctuates across batches #686