twitter-research / neural-sheaf-diffusion

Apache License 2.0
63 stars 16 forks source link

Error occurred while duplicating code #6

Closed WanLang0 closed 1 year ago

WanLang0 commented 1 year ago

Hi,your project code is excellent.But I ran into a problem at runtime.I'm running sh ./exp/scripts/run_texas.sh, the console reports the following error,Please take a look, thank you!

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
./exp/scripts/run_texas.sh: line 18:   761 Aborted                 (core dumped) python -m exp.run --dataset=texas --d=3 --layers=4 --hidden_channels=20 --left_weights=True --right_weights=True --lr=0.02 --weight_decay=5e-3 --input_dropout=0.0 --dropout=0.7 --use_act=True --model=BundleSheaf --normalised=True --sparse_learner=True --entity="${ENTITY}"
crisbodnar commented 1 year ago

This is most likely because you ran out of memory (see for instance https://www.positioniseverything.net/terminate-called-after-throwing-an-instance-of-std_bad_alloc/). Can you check your memory consumption while running the model? Also, it is hard to provide more help without more details about your setup. Could you give more details about the machine you use, etc.

WanLang0 commented 1 year ago

Thank you for your reply. My configuration is a Tesla T4 GPU and Ubuntu 18.04 system with a single CUDA 10.2 video memory of 15G. I have observed my gpu occupancy rate, which has been kept close to unoccupied. My operation steps are as follows:

1.Execute command conda env create --file=environment_gpu.yml and conda activate nsd,I installed pyg module manually because of 404 error 2.Execute command pytest -v ..As a result, there are two kinds of errors: Aborted (core dumped);Segmentation fault (core dumped) 3.Execute command sh ./exp/scripts/run_texas.sh.The above errors also occur, occasionally terminate called after throwing an instance of 'std::bad_alloc'

crisbodnar commented 1 year ago

OK. What was the 404 error more exactly? conda env create --file=environment_gpu.yml should just work. Did you install the same version of pyG as the one in environment_gpu.yml? The conda installation from file is also doing some dependency checking and might change certain versions to make packages compatible, so it is not equivalent to installing pyG separately.

I did test the repo on multiple machines (with MacOS and Ubuntu) and other people managed to run them in the last few days. Our automated tests (https://github.com/twitter-research/neural-sheaf-diffusion/actions) are also installing the dependencies and running all the tests, and they are currently passing. So I would guess it is because things are not installed properly or some other problem in your setup.

If you could tell me how I can reproduce this error, I can look into it.

WanLang0 commented 1 year ago

OK, I have run it successfully now. The reason for the previous error is that pyg is not installed correctly. Thank you for your help.

crisbodnar commented 1 year ago

Glad you managed to get it working. Will close the issue now, but let me know if you encounter other issues.