All processes allocate memory on rank 0 during StreamingDataset initialization in a distributed setting

ohallstrom commented 2 months ago

Environment

OS: [Ubuntu Ubuntu 22.04.2]
Hardware (GPU, or instance type): [DGX with 8xH100, CUDA 12.0]

To reproduce

Steps to reproduce the behavior: Using the mosaic codebase to train mosaic-bert and modules installed from requirements.txt, I run the following: composer -n 8 main.py yamls/main/hf-bert-base-uncased.yaml In requirements.txt, the streaming is set to version 0.4.1, but I also get the same behavior with the latest version of streaming (0.7.6).

Behaviour

When the StreamingDataset is initialized in this distributed setting, all processes allocates some memory on rank 0, despite the fact that they use another rank during training. This leads to an uneven distribution of memory over the GPU:s, preventing us from using all other ranks than 0 to their fullest potential during training. See result of nvidia-smi right after StreamingDataset initialization below:

With other configs than yamls/main/hf-bert-base-uncased.yaml, the memory that is allocated by each process on rank 0 during init of StreamingDataset can be as big as 2 GB, which with 8 GPU:s makes it so that rank 0 has 16 GB more memory allocated than the other ranks during training. See example below for GPU memory usage during training:

(had to mask out process names for privacy reasons)

XiaohanZhangCMU commented 2 months ago

The examples repo is out-of-date. Have you tried something from our latest stack llm-foundry (which still uses composer as the engine). For example, you can find example yamls here. let us know if you still run into the same issue when using llm-foundry.

warner-benjamin commented 1 month ago

@XiaohanZhangCMU On a two GPU machine it appears that llm-foundry doesn't allocate extra memory when using a streaming dataset, but the examples repo does. Do you have an idea of what streaming dataset initialization settings we'd need to change to match llm-foundry?

XiaohanZhangCMU commented 1 month ago

Can you elaborate a bit about "llm-foundry doesn't allocate extra memory"? llm-foundry is using composer as the launcher under the hood, so anything you tried with composer, it should work similarly.

The examples repo has not been actively developed, so I would suggest not focus on those examples. We use llm-foundry+streaming all the time and I don't think you need to change anything specific for llm-foundry. If you do need to implement your own dataset on top of streaming base class, take a look at the "FinetuningStreamingDataset" here as an example.

warner-benjamin commented 1 month ago

As ohallstrom mentioned, we forked from the mosaic/examples repo to train mosaic-bert since llm-foundry doesn't have it (and there's nothing on the examples readme that says don't use).

Using mosaic streaming dataset from mosaic/examples code causes the extra memory usage on the rank 0 gpu which ohallstrom mentioned, but my test with using mosaic streaming dataset from llm-foundry code doesn't use any additional gpu memory (at least on a two-gpu machine).

In both cases, I'm using composer+streaming. But something isn't working right with mosaic/examples streaming dataset. We're too far along to switch to llm-foundry, so I'm hoping you'll be able to point out what might be different in the mosaic/examples streaming setup that could be causing the extra memory usage.

mosaicml / streaming