Closed MoustHolmes closed 1 year ago
From the logs it looks like you are using GPU 1 but the GPU utilization rate on GPU 1 is abysmal and looks like its an improper run. Have you tested for different configurations such as learning rate, batch size etc. ?
I would recommend that you terminate the run as it does not use any gpu resources. I have had similar runs where I had to test different configurations.
I second @MortenHolmRep's point above: You see slow training because the GPU is basically not being utilised. This could be due to a too-small batch size, too few workers to load data, expensive CPU operations either in the model or in the dataloaders, etc. or you running your training script at the same time as someone else, and they're doing some of the above.
I'll be closing this issue as, in its current form, I wouldn't consider this a bug in the graphnet
code, rather a poorly performing training run. I suggest you use the #help channel in Slack to get feedback to improve the GPU utilisation. If you do narrow down the problem to part of the graphnet
code, please do open a dedicated issue.
I have been running on graphnet cleaned data and the training have been really slow more than i would expect from a larger dataset. the previous dataset which wasn't graphnet cleaned is 1/10 of the current but i could run about 10 epochs in an hour ish. the larger dataset have run for now almost 24h and only done a half epoch which is slower than i would expect simply from the size increase. a possible culprit could be that the pulsemap contains events with no hits due to cleaning. other than that i have also noticed that if the task have more than one target it also seams to slows down significantly.
wandb logs data config