mims-harvard / SubGNN

Subgraph Neural Networks (NeurIPS 2020)
https://zitniklab.hms.harvard.edu/projects/SubGNN
MIT License
185 stars 33 forks source link

Memory issues and other bugs #7

Closed rmwu closed 3 years ago

rmwu commented 3 years ago

Hello! I was running your codebase and came across several bugs, as well as memory issues.

1) I processed a dataset of my own with 21584 vertices and 342685 edges, as well as 16734 subgraph labels. Running the code on this, with a batch size of 128, ran into memory issues: "RuntimeError: CUDA out of memory. Tried to allocate 96.11 GiB (GPU 0; 10.92 GiB total capacity; 5.34 MiB already allocated; 10.40 GiB free; 22.00 MiB reserved in total by PyTorch)"

I noticed that most of your experiments are even larger, with more vertices and/or edges, so if I may ask, does this error sound reasonable, and what sort of infrastructure were your experiments performed on? In your experience, what dataset size and/or parameters would be reasonable to run on a single GPU?

2) when I run on the ppi_bp dataset, downloaded from Dropbox, and the provided config json, line 45 of gamma.py raises the error "KeyError: 21114" when computing degrees "graph_degree_seq = [degree_dict[n-1] for n in nodes]"

3) line 100 of SubGNN.py raises the error "RuntimeError: Cannot set the device explicitly. Please use module.to(new_device)." This suggests that self.device could not be used as a variable. I'm not sure whether this is due to differing versions of the packages, as I installed them myself, but I changed this line to self.device_num

4) data preprocessing code saves graphsaint embeddings as "gcn_graphsaint_embeddings.pth" but line 229 of train_config.py loads them as "graphsaint_gcn_embeddings.pth"

Thank you for your help!

EmilyAlsentzer commented 3 years ago

Thanks for using SubGNN! Below are answers to your questions 1 & 4. Let me look into questions 2 & 3 and get back to you.

  1. Memory for SubGNN scales as a function of the number of subgraphs. Your dataset 16k subgraphs is actually larger than the datasets we used in our paper. Are you able to run with a smaller batch size?

  2. Good catch, we recently refactored the code to make it easier for others to use, but clearly missed a few bugs. I've updated the code base to use graphsaint_gcn_embeddings.pth for everything.

leaf-ygq commented 3 years ago

Hello, thanks for your sharing! When I ran on the ppi_bp dataset, I also encountered this bug: "KeyError: 17605" when computing degrees "graph_degree_seq = [degree_dict[n-1] for n in nodes]" Would you mind helping me deal with it?

EmilyAlsentzer commented 3 years ago

Hi @leaf-ygq and @rmwu, I figured out what was causing the bug. We had uploaded a degree_sequence.txt file in the dropbox corresponding to an older version of the PPI network. Please redownload the ppi_bp folder from Dropbox and reopen an issue if you have any more trouble.

@rmwu - I'm not able to reproduce the bug you mention in (3). It's possible that it resulted from a mismatch in environments. Hopefully this issue doesn't appear with the updated conda env, but please let me know if you keep on running into this issue.

luoyuanlab commented 3 years ago

For (3), I ran into the same issue, changing line 99 in SubGNN.py to the following seem to work.

self.to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

Update: the above is true for pytorch_lightning v1.0.7. For v0.7.1, the original code works.