libffcv / ffcv-imagenet

Train ImageNet *fast* in 500 lines of code with FFCV
Apache License 2.0
136 stars 34 forks source link

Training extremely slow #11

Open netw0rkf10w opened 2 years ago

netw0rkf10w commented 2 years ago

Hello,

I followed closely the README and launched a training using the following command on a server with 8 V100 GPUs:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train_imagenet.py --config-file rn50_configs/rn50_88_epochs.yaml \
    --data.train_dataset=$HOME/data/imagenet_ffcv/train_500_0.50_90.ffcv \
    --data.val_dataset=$HOME/data/imagenet_ffcv/val_500_0.50_90.ffcv \
    --data.num_workers=3 --data.in_memory=1 \
    --logging.folder=$HOME/experiments/ffcv/rn50_88_epochs

Training took almost an hour per epoch, and the second epoch is almost as slow as the first one. The output of the log file is as follows:

cat ~/experiments/ffcv/rn50_88_epochs/d9ef0d7f-17a3-4e57-8d93-5e7c9a110d66/log 
{"timestamp": 1650641704.0822473, "relative_time": 2853.3256430625916, "current_lr": 0.8473609134615385, "top_1": 0.07225999981164932, "top_5": 0.19789999723434448, "val_time": 103.72948884963989, "train_loss": null, "epoch": 0}
{"timestamp": 1650644358.3394542, "relative_time": 5507.582849979401, "current_lr": 1.6972759134615385, "top_1": 0.16143999993801117, "top_5": 0.3677400052547455, "val_time": 92.9171462059021, "train_loss": null, "epoch": 1}

Is there anything I should check?

Thank you in advance for your response.

afzalxo commented 2 years ago

I had the same issue when I used train_500_0.50_90.ffcv train dataset configuration since the dataset file size was over 300GB, larger than 256GB of memory that I have on the server. The training speed is replicable for a smaller dataset file size. I utilized train_400_0.50_90 and things seem to be working as they should but the accuracy is lower than reported.

This work assumes you have enough RAM on your server to store the entire dataset. Otherwise, I do not think this work is helpful for distributed training since QUASI_RANDOM sampling hasn't been implemented for the case when you can't cache your entire dataset on RAM.

netw0rkf10w commented 2 years ago

@afzalxo Thanks for the information. My server has 384GB of RAM while the size of train_500_0.50_90.ffcv is only 339GB, so I'm not sure if memory was really an issue.

The training speed is replicable for a smaller dataset file size. I utilized train_400_0.50_90 and things seem to be working as they should but the accuracy is lower than reported.

This is a bit strange though. Could you share how much lower is your accuracy? The maximum resolution used by the training script is only 224x224 (more precisely, 192x192 for training and 224x224 for validation), so I wouldn't expect too much difference between re-scaling from 400x400 and re-scaling from 500x500.

This work assumes you have enough RAM on your server to store the entire dataset. Otherwise, I do not think this work is helpful for distributed training since QUASI_RANDOM sampling hasn't been implemented for the case when you can't cache your entire dataset on RAM.

So what you are saying that if we don't use os_cache=True then there's no speedup compared to PyTorch's data loader, is that correct? @GuillaumeLeclerc, could you please confirm this?

afzalxo commented 2 years ago

My server has 384GB of RAM while the size of train_500_0.50_90.ffcv is only 339GB, so I'm not sure if memory was really an issue.

Interesting. I tried this experiment on two different servers although both have less memory than 339GB and I faced the same issue as you on both servers. When I utilized write_mode=jpg rather than mix/proportion, the dataset file became much smaller at around 42G for train_400_0.50_90.ffcv and I got the expected speedup. Hence my conclusion was that memory size was the likely issue.

Could you share how much lower is your accuracy?

I don't have exact numbers right now since the logs are in a messy state but the difference was around 2% IIRC. However I ran my experiments with 160px input throughout rather than with progressive resizing to 192px. So the accuracy gap is likely due to 1) Different input resolution and 2) Different input dataset configuration of write_mode=jpg with 400_0.50_90.

I wouldn't expect too much difference between re-scaling from 400x400 and re-scaling from 500x500

I'm not sure about this point. The authors do mention that

Generally larger side length will aid in accuracy but decrease throughput

So what you are saying that if we don't use os_cache=True then there's no speedup compared to PyTorch's data loader

No not at all. I am saying that os_cache=False is needed when dataset size is too large but since QUASI_RANDOM sampling has not yet been implemented for the dataloader when distributed=True, multi-GPU training cannot be used when os_cache=False. Hence, os_cache=True is needed for multi-GPU training. So if you have multiple GPUs and but not enough RAM, you need to switch to a smaller dataset that can fit.

Why are you using num_workers=3? Isn't that a bit too low? Have you tried 8 or 12?

netw0rkf10w commented 2 years ago

@afzalxo Thanks for your reply.

Interesting. I tried this experiment on two different servers although both have less memory than 339GB and I faced the same issue as you on both servers. When I utilized write_mode=jpg rather than mix/proportion, the dataset file became much smaller at around 42G for train_400_0.50_90.ffcv and I got the expected speedup. Hence my conclusion was that memory size was the likely issue.

I think you're probably right. 339GB is not that far from the limit of my server. I'll try again with write_mode=jpeg.

I don't have exact numbers right now since the logs are in a messy state but the difference was around 2% IIRC. However I ran my experiments with 160px input throughout rather than with progressive resizing to 192px. So the accuracy gap is likely due to 1) Different input resolution and 2) Different input dataset configuration of write_mode=jpg with 400_0.50_90.

Makes sense! (As a side note, the 0.50 in the file name is a bit misleading because when write_mode=jpg, all the images are converted to JPEG and not 50% of them.)

No not at all. I am saying that os_cache=False is needed when dataset size is too large but since QUASI_RANDOM sampling has not yet been implemented for the dataloader when distributed=True, multi-GPU training cannot be used when os_cache=False. Hence, os_cache=True is needed for multi-GPU training. So if you have multiple GPUs and but not enough RAM, you need to switch to a smaller dataset that can fit.

I see. I didn't know that os_cache=False doesn't work for distributed training, thanks for the info.

Why are you using num_workers=3? Isn't that a bit too low? Have you tried 8 or 12?

That server only has 3 CPUs per GPU. There's another server I can use that has 10 CPUs per GPU but only half the amount of RAM of the former (~180GB). I guess I'll have to play a bit with the parameters of DatasetWriter to find a suitable combination for my servers.