philschmid / deep-learning-pytorch-huggingface

MIT License
580 stars 138 forks source link

Inference on CNN validation set takes 2+ hours on p4dn.24xlarge machine with 8 A100s, 40GB each #13

Open sverneka opened 1 year ago

sverneka commented 1 year ago

I tried to run the code as it is for training and at the end of each epoch it does inference on test set, I found that it was taking too long for inference and the GPU memory and utilization was getting maxed out on p4dn.24x that has 8 A100s, 40GB. Surprisingly the training was much faster than inference! Any idea how to fix this? Thanks!

sverneka commented 1 year ago

@stas00 Can you please help me with this? Thanks!

stas00 commented 1 year ago

I'm not quite sure why you're tagging me here as I am not part of this project and I have no idea what code you're talking about.

If it's a transformers question please ask https://github.com/huggingface/transformers/issues and give full context of the issue.

Thank you.

Ro0tee commented 1 year ago

The same issue occurs while fine-tuning Flan-T5 with LoRA and bnb int-8 on a summarisation dataset using 1 A100 40G. It takes a long time for inference while the training is very fast. Any solution? Thank you!

philschmid commented 1 year ago

This doesn't seem like an issue to me. Have you tried running inference after the training is done? and adjusted the parameters?

sadahanu commented 1 year ago

having seen the same problem and I got warning messages like:
Invalidate trace cache @ step 0: expected module 2, but got module 0