Open zhangyahu1 opened 3 years ago
Now I run the code without GPUs. After several epochs for training, there is an error:
Traceback (most recent call last): File "train.py", line 93, in
train_valid() File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/config.py", line 1032, in wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/utils.py", line 48, in augment_exception_message_and_reraise six.raise_from(proxy.with_traceback(exception.traceback), None) File " ", line 3, in raise_from File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/config.py", line 1009, in wrapper return fn(*new_args, **new_kwargs) File "train.py", line 87, in train_valid valid_freq=val_freq, reduce='mean') File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch/monitor.py", line 932, in run_training raise ValueError('NaN or Inf encountered. Training failed!') ValueError: NaN or Inf encountered. Training failed!
I would appreciate it if you can give me some advice to solve this problem.
Hi @zhangyahu1. Could you please give me more details about your conda and pytorch environments? The error comes from TORCH_CHECK
which is not available in early versions of Pytorch.
Hi @justanhduc, my environment is:
pytorch 1.5.1 torchvision 0.6.1 cudatoolkit 10.1 python 3.6
The code can run now but get the following error:
raise ValueError('NaN or Inf encountered. Training failed!') ValueError: NaN or Inf encountered. Training failed!
Hi @zhangyahu1. Are you able to run on GPU now? I think I used Pytorch 1.7 for this code. Could you please try again?
Thanks! @justanhduc I will try to use Pytorch 1.7 to run the code.
Hi @justanhduc, It works now when I use Pytorch 1.7. However, the code works well with less data, and it raises error: 'CUDA out of memory' within one epoch when all data is used. I wonder if it is because the memory is not released during training.
It seems I use wrong verison of neuralnet-pytorch. Then I download neuralnet-pytorch of right verison and run: python setup.py install --cuda-ext, However, it raises the following error when I run the code:
Traceback (most recent call last): File "train.py", line 16, in
import neuralnet_pytorch.gin_nnt as gin File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+unknown-py3.6-linux-x86_64.egg/neuralnet_pytorch/init.py", line 38, in import neuralnet_pytorch.ext as ext ImportError: /home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+unknown-py3.6-linux-x86_64.egg/neuralnet_pytorch/ext.cpython-36m-x86_64-linux-gnu.so: undefined symbol: PyThread_tss_create
I would appreciate it if you can give me some suggestions.
I also try to run pip install "neuralnet-pytorch[gin] @ git+git://github.com/justanhduc/neuralnet-pytorch.git@6bda19fdc57f176cb82f58d287602f4ccf4cfc23" --global-option="--cuda-ext"
There exists an error