training seems to be too slow?

macabdul9 / CASA-Dialogue-Act-Classifier

PyTorch implementation of the paper "Dialogue Act Classification with Context-Aware Self-Attention" for dialogue act classification with a generic dataset class and PyTorch-Lightning trainer

MIT License

44 stars 13 forks source link

training seems to be too slow? #7

Open PolKul opened 3 years ago

PolKul commented 3 years ago

Hi, I have a system with Razor Threadripper 24 core processor and Titan RTX card. But when I run training script it takes more than 6 seconds per iteration. Everything is setup with your default parameters. If I train it for 100 epochs it would potentially take 600 hours :)

Epoch 0: 0%| | 13/3337 [01:29<6:20:41, 6.87s/it, loss=3.736, v_num=48]

Is this normal, or there is something we can tune up to improve performance? Thanks

macabdul9 commented 3 years ago

Hi @PolKul , Yes this is normal because for each utterance we need dialogue history hence we can't parallelize the training. Although here is Kaggle Kernel to train it on Kaggle Compute which will take around 1hr/epoch. Another thing you can do is instead of running evaluation after each epoch you can evaluate it for smaller data and after each k(where k>1) epochs, this can be configured inpl.Trainer line 43 main.py. You need not to worry about100 epochsit will converge much before that and there'sEarlyStopping` in place.

Also it' takes that time much because the data is significantly large. @glicerico has trained checkpoints so if he can share, it will be very helpful. You can directly run the evaluation and/or you can tweak the training configuration manually (or hyperparameter search) and re-train it from the checkpoint, will take only few epochs to converge again.

Hope this helps.

PolKul commented 3 years ago

Hi @macabdul9,

Thank you for the comments. Yes, it would really help if you could share the trained checkpoint. May I ask you to share it with me, please? Thanks.

glicerico commented 3 years ago

I am uploading to dropbox to share, but it's like half a GB in size. Github has a 100MB file size limit. Do you have some place to host the checkpoint @macabdul9 ?

glicerico commented 3 years ago

@PolKul here's the checkpoint for my trained model: https://www.dropbox.com/s/y42bw6qmoa9b8k2/epoch%3D29-val_accuracy%3D0.748834.ckpt?dl=0 However, I just realized that a new commit that fixes a change suggested in issue https://github.com/macabdul9/CASA-Dialogue-Act-Classifier/issues/5 to fix the class order, so I am not sure if that makes this checkpoint unusable I am not sure if this will make the above checkpoint unusable Let me know how it works for you

macabdul9 commented 3 years ago

Yes, @glicerico it will not be useful but if you have label dictionary for your training then it will be useful.

PolKul commented 3 years ago

@glicerico, thank you for the checkpoint. But as I understand, as per the @macabdul9 comment, I cannot use it without the "label dictionary", right? If you have that dictionary, maybe you can send it as well?

glicerico commented 3 years ago

Unfortunately I don't have a label dictionary. I will soon restart re-training after the Fix for issue #5

glicerico commented 3 years ago

@PolKul , @macabdul9 Here's a checkpoint after the fix to issue #5

https://www.dropbox.com/s/hn3d3c273aiyymo/epoch%3D29-val_accuracy%3D0.751411.ckpt?dl=0

macksjeremy commented 3 years ago

Still an issue for me even with Kaggle GPU's. Taking about three quarters of an epoch per hour. Was wondering if there was a checkpoint available or a way to grab fastest checkpoint for Kaggle GPU's.