Slow training speed - Githubissues

liuyueChang commented 1 year ago

Thank you for your paper and code! I am training the model on my machine, the gpu is A6000. When i set the batch_size 128 or 256, there is no improvement in training speed. totally It will spend about 1 day to finish the 72 epochs Can you give me some advise or possible method to solve this problem?

Here is my environment If you want me give more infomation, please tell me! Thank you very much!

_pytorch-lightning 1.5.10 pytz 2022.1 PyYAML 6.0 requests 2.27.1 requests-oauthlib 1.3.1 rsa 4.8 scikit-learn 1.0.2 scipy 1.8.0 setuptools 59.5.0 shapely 2.0.1 six 1.16.0 sklearn 0.0.post1 tensorboard 2.8.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 termcolor 2.2.0 threadpoolctl 3.1.0 torch 1.11.0 torch-geometric 2.0.4 torch-scatter 2.0.9 torch-sparse 0.6.13 torchaudio 0.11.0 torchcde 0.2.5 torchdiffeq 0.2.3 torchmetrics 0.8.0 torchsde 0.2.5 torchvision 0.12.0 tqdm 4.64.0 trampoline 0.1.2 typingextensions 4.1.1 urllib3 1.26.8 Werkzeug 2.1.1 wheel 0.37.1 yarl 1.7.2 zipp 3.8.0

liuyueChang commented 1 year ago

ohhh, there is a very important phenomenon during my training. the gpu util rate increases to 20% and then decrease, then the gpu util rate will still at 0 for a span During my training, such phenomenon is recurring

schmidt-ju commented 1 year ago

Hey :)

based on your description, my guess is that you use online preprocessing. Is this the case?

Julian

liuyueChang commented 1 year ago

thank you for you answer! I have run the preprocessing script according to your readme, and the preprocessing result has save in pkl file I am very confusing about this phenomenon

schmidt-ju commented 1 year ago

Are you also using the preprocessed file during training? --use_preprocessed=True

liuyueChang commented 1 year ago

yes I use the --use_preprocessed=True option And I debug the train.py, it steps into the if args.use_preprocessed: with open(input_preprocessed, 'rb') as f: self.data = pickle.load(f)

liuyueChang commented 1 year ago

I have solved this problem. The num_worker should set 0! Because the parameter num_worker is high, in my code, this parameter is set to 4. It makes the training very slow Although i still don't figure out the reason, i can continue my own work!

schmidt-ju commented 1 year ago

Awesome! Thanks for letting me know :D

schmidt-ju / crat-pred

Slow training speed #11