pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.16k stars 3.64k forks source link

Unexplainable loss spikes/peaks #8264

Open AdrianSchneble opened 11 months ago

AdrianSchneble commented 11 months ago

πŸ› Describe the bug

I'm currently having problems similar to those detailed in issue #1478. Comments on that issue, as well as the linked Stats SE post, mostly depict the occurrence of loss spikes as instabilities that are more or less unavoidable when using GNNs. However, the latest update to either is by now over 3 years old (not counting the tangentially related issue #2004; but that's just slightly more recent, anyway). Furthermore, it was also mentioned that while GNNs encounter some instabilities, there has been "no thorough study about this".

Since I'm also seeing similarly (as far as my expertise goes unexplainable) loss spikes in the project I'm currently working on, I'm curious as to whether there have been any updates concerning this behaviour in the meantime, that didn't make it into the old issue(s). I myself couldn't find any studies on the problem, but I'm hoping someone among the maintainers of PyG may have more insight. More specifically, I ultimately would like to know if this is just behaviour I'll have to live with, or if the bug has a potential solve by now.

As for my own testing so far: since parts of the discussion suggest the optimizer may be at fault, I have tried using both the Adamax and SGD optimizers, but neither option eliminated the loss spikes. I have not tried the mentioned gradient clipping, but the previous discussion doesn't suggest that it may be of much help, anyway. Side note: I'm aware Python 3.7 is not exactly the latest version of Python, but it's a limitation of the environment I'm working in that I cannot currently change; the same is true for being restricted to CPU-based training.

For illustrative purposes, here's an exemplary loss graph visualizing the problem:

losses_fold_0

Environment

rusty1s commented 11 months ago

Sorry for late reply. I think these issues are super hard to track down. From my personal experience, I don't see this at all on my use-cases. A few things to try: