Open netw0rkf10w opened 2 years ago
I had the same issue when I used train_500_0.50_90.ffcv
train dataset configuration since the dataset file size was over 300GB, larger than 256GB of memory that I have on the server. The training speed is replicable for a smaller dataset file size. I utilized train_400_0.50_90
and things seem to be working as they should but the accuracy is lower than reported.
This work assumes you have enough RAM on your server to store the entire dataset. Otherwise, I do not think this work is helpful for distributed training since QUASI_RANDOM
sampling hasn't been implemented for the case when you can't cache your entire dataset on RAM.
@afzalxo Thanks for the information. My server has 384GB of RAM while the size of train_500_0.50_90.ffcv
is only 339GB, so I'm not sure if memory was really an issue.
The training speed is replicable for a smaller dataset file size. I utilized
train_400_0.50_90
and things seem to be working as they should but the accuracy is lower than reported.
This is a bit strange though. Could you share how much lower is your accuracy? The maximum resolution used by the training script is only 224x224 (more precisely, 192x192 for training and 224x224 for validation), so I wouldn't expect too much difference between re-scaling from 400x400 and re-scaling from 500x500.
This work assumes you have enough RAM on your server to store the entire dataset. Otherwise, I do not think this work is helpful for distributed training since
QUASI_RANDOM
sampling hasn't been implemented for the case when you can't cache your entire dataset on RAM.
So what you are saying that if we don't use os_cache=True
then there's no speedup compared to PyTorch's data loader, is that correct? @GuillaumeLeclerc, could you please confirm this?
My server has 384GB of RAM while the size of train_500_0.50_90.ffcv is only 339GB, so I'm not sure if memory was really an issue.
Interesting. I tried this experiment on two different servers although both have less memory than 339GB and I faced the same issue as you on both servers. When I utilized write_mode=jpg
rather than mix/proportion
, the dataset file became much smaller at around 42G for train_400_0.50_90.ffcv
and I got the expected speedup. Hence my conclusion was that memory size was the likely issue.
Could you share how much lower is your accuracy?
I don't have exact numbers right now since the logs are in a messy state but the difference was around 2% IIRC. However I ran my experiments with 160px input throughout rather than with progressive resizing to 192px. So the accuracy gap is likely due to 1) Different input resolution and 2) Different input dataset configuration of write_mode=jpg
with 400_0.50_90
.
I wouldn't expect too much difference between re-scaling from 400x400 and re-scaling from 500x500
I'm not sure about this point. The authors do mention that
Generally larger side length will aid in accuracy but decrease throughput
So what you are saying that if we don't use os_cache=True then there's no speedup compared to PyTorch's data loader
No not at all. I am saying that os_cache=False
is needed when dataset size is too large but since QUASI_RANDOM
sampling has not yet been implemented for the dataloader when distributed=True
, multi-GPU training cannot be used when os_cache=False
. Hence, os_cache=True
is needed for multi-GPU training. So if you have multiple GPUs and but not enough RAM, you need to switch to a smaller dataset that can fit.
Why are you using num_workers=3
? Isn't that a bit too low? Have you tried 8 or 12?
@afzalxo Thanks for your reply.
Interesting. I tried this experiment on two different servers although both have less memory than 339GB and I faced the same issue as you on both servers. When I utilized
write_mode=jpg
rather thanmix/proportion
, the dataset file became much smaller at around 42G fortrain_400_0.50_90.ffcv
and I got the expected speedup. Hence my conclusion was that memory size was the likely issue.
I think you're probably right. 339GB is not that far from the limit of my server. I'll try again with write_mode=jpeg
.
I don't have exact numbers right now since the logs are in a messy state but the difference was around 2% IIRC. However I ran my experiments with 160px input throughout rather than with progressive resizing to 192px. So the accuracy gap is likely due to 1) Different input resolution and 2) Different input dataset configuration of
write_mode=jpg
with400_0.50_90
.
Makes sense!
(As a side note, the 0.50
in the file name is a bit misleading because when write_mode=jpg
, all the images are converted to JPEG and not 50% of them.)
No not at all. I am saying that
os_cache=False
is needed when dataset size is too large but sinceQUASI_RANDOM
sampling has not yet been implemented for the dataloader whendistributed=True
, multi-GPU training cannot be used whenos_cache=False
. Hence,os_cache=True
is needed for multi-GPU training. So if you have multiple GPUs and but not enough RAM, you need to switch to a smaller dataset that can fit.
I see. I didn't know that os_cache=False
doesn't work for distributed training, thanks for the info.
Why are you using
num_workers=3
? Isn't that a bit too low? Have you tried 8 or 12?
That server only has 3 CPUs per GPU. There's another server I can use that has 10 CPUs per GPU but only half the amount of RAM of the former (~180GB). I guess I'll have to play a bit with the parameters of DatasetWriter
to find a suitable combination for my servers.
Hello,
I followed closely the README and launched a training using the following command on a server with 8 V100 GPUs:
Training took almost an hour per epoch, and the second epoch is almost as slow as the first one. The output of the log file is as follows:
Is there anything I should check?
Thank you in advance for your response.