Training times - bottlenecked by dataloader?

giuqoob commented 1 year ago

Has anyone else had a tough time with training times using the scripts provided? I recently upgraded my GPU to a RTX 4090 and noticed there was barely any improvement from a GTX 980 Ti (from 2015) and started getting suspicious. One epoch takes 9 hours with the new card, which seems a long time. I played around with the following parameters with no success. I'm led to believe the issue is with how the data is loaded.

Set num_workers from 1 to 12 in the DataLoader -> no change
Played around with batch_size from 2 to 8 -> training time per epoch went from 12 to 9 hours.
Set pin_memory to True in DataLoader -> no change.
In table_datasets.py > PDFTablesDataset > __getitem__ I tried converting bboxes to bfloat16 instead of float32 -> no change.

For reference all the training data is on a NVMe drive, the CPU is a i7-8700K and I have enough RAM available to load the full dataset on RAM. I'm fairly new to deep learning so I don't want to go changing things too much and risk training a model that does not work. Changing float32 to bfloat16 is already potentially risky. I'm on CUDA 12.2 and Pytorch 2.0

Any suggestions or similar experiences?

giuqoob commented 1 year ago

I've tried to debug this for a while now, playing around with settings for DataLoader without much success. It does not look like data loading is the bottleneck after all.

What types of training speeds have you achieved and with what type of configuration? I'm trying to train the model on both PubTables1M and FinTabNet data and it is painfully slow even with high end hardware, at around 10 hours per epoch. If I want to hit 20 epochs it is going to take a while at this rate. Just to make sure my hardware is supported I´ve updated torch to torch 2.1.0.dev20230608+cu121 and also installed cuDNN v8.9.2. CUDA is v12.1

giuqoob commented 1 year ago

Would love to get a confirmation from @bsmock - did you monitor how long training took? I wonder if there is something I could be doing differently. After several days I am now at epoch 9 for a PubTab1M+FinTabNet(.a6) dataset.

bsmock commented 1 year ago

Yes we did have long training times like the ones you are observing. At the time we wrote the training code, we wanted to assume minimal hardware resources so anyone could reproduce our results.

giuqoob commented 1 year ago

Thanks for getting back to me so quickly and good to hear this is the case - I'll close this up as the question was answered. For reference I did some tests using the setup from the original paper (so no cap to images per epoch) https://unpoco.notion.site/GPU-considerations-e1d93c6916634f2981aa462bed23129d?pvs=4 I did end up increasing the batch_size to 8 since there was a significant performance increase, with hopes that it won't impact accuracy much as these are still relatively small values.

microsoft / table-transformer

Training times - bottlenecked by dataloader? #117