Training process - Githubissues

zhangyahu1 commented 3 years ago

I also try to run pip install "neuralnet-pytorch[gin] @ git+git://github.com/justanhduc/neuralnet-pytorch.git@6bda19fdc57f176cb82f58d287602f4ccf4cfc23" --global-option="--cuda-ext"

There exists an error

ERROR: Command errored out with exit status 1: command: /home/yzhang4/anaconda3/envs/graphxx/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/setup.py'"'"'; file='"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools i mport setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cuda-ext install --record /tmp/pip-record-jzi_zomv/install-record.txt --single-versi on-externally-managed --compile --install-headers /home/yzhang4/anaconda3/envs/graphxx/include/python3.6m/neuralnet-pytorch cwd: /tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/ Complete output (114 lines): running install running build running build_py creating build creating build/lib.linux-x86_64-3.6 creating build/lib.linux-x86_64-3.6/neuralnet_pytorch copying neuralnet_pytorch/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch copying neuralnet_pytorch/_version.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch copying neuralnet_pytorch/version.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch copying neuralnet_pytorch/metrics.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch copying neuralnet_pytorch/monitor.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions copying neuralnet_pytorch/extensions/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions copying neuralnet_pytorch/extensions/bpd.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions copying neuralnet_pytorch/extensions/dist_chamfer.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions copying neuralnet_pytorch/extensions/dist_emd.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions copying neuralnet_pytorch/extensions/pc2vox.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/gin_nnt copying neuralnet_pytorch/gin_nnt/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/gin_nnt copying neuralnet_pytorch/gin_nnt/external_configurables.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/gin_nnt creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/adain.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/aggregation.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/blocks.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/points.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/resizing.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/abstract.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/convolution.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers copying neuralnet_pytorch/layers/normalization.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim copying neuralnet_pytorch/optim/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim copying neuralnet_pytorch/optim/adabound.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim copying neuralnet_pytorch/optim/lookahead.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim copying neuralnet_pytorch/optim/nadam.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils copying neuralnet_pytorch/utils/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils copying neuralnet_pytorch/utils/activation_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils copying neuralnet_pytorch/utils/cv_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils copying neuralnet_pytorch/utils/misc_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils copying neuralnet_pytorch/utils/tensor_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo copying neuralnet_pytorch/zoo/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo copying neuralnet_pytorch/zoo/resnet.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo copying neuralnet_pytorch/zoo/vgg.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler copying neuralnet_pytorch/optim/lr_scheduler/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler copying neuralnet_pytorch/optim/lr_scheduler/inverse_lr.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler copying neuralnet_pytorch/optim/lr_scheduler/warm_restart.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler UPDATING build/lib.linux-x86_64-3.6/neuralnet_pytorch/_version.py set build/lib.linux-x86_64-3.6/neuralnet_pytorch/_version.py to '1.0.0+fancy.144.g6bda19f' running build_ext building 'neuralnet_pytorch.ext' extension creating build/temp.linux-x86_64-3.6 creating build/temp.linux-x86_64-3.6/neuralnet_pytorch creating build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions creating build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions/csrc gcc -pthread -B /home/yzhang4/anaconda3/envs/graphxx/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ineuralnet_pytorch/extensions/include -I/home/yzhang4/anaconda3 /envs/graphxx/lib/python3.6/site-packages/torch/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packag es/torch/include/TH -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/yzhang4/anaconda3/envs/graphxx/include/python3.6m -c neuralnet_pytorch/extension s/csrc/bindings.cpp -o build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions/csrc/bindings.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from neuralnet_pytorch/extensions/include/bpd.h:2:0, from neuralnet_pytorch/extensions/csrc/bindings.cpp:1: /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/torch.h:7:2: warning: #warning "Including torch/torch.h for C++ extensions is deprecated. Please include torch/e xtension.h" [-Wcpp]

warning \

^~~ gcc -pthread -B /home/yzhang4/anaconda3/envs/graphxx/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ineuralnet_pytorch/extensions/include -I/home/yzhang4/anaconda3 /envs/graphxx/lib/python3.6/site-packages/torch/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packag es/torch/include/TH -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/yzhang4/anaconda3/envs/graphxx/include/python3.6m -c neuralnet_pytorch/extension s/csrc/chamfer_cuda.cpp -o build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions/csrc/chamfer_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from neuralnet_pytorch/extensions/include/chamfer_cuda.h:2:0, from neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:1: /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/torch.h:7:2: warning: #warning "Including torch/torch.h for C++ extensions is deprecated. Please include torch/e xtension.h" [-Wcpp]

warning \

^~~ In file included from neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:3:0: neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp: In function ‘std::vector chamfer_forward(at::Tensor, at::Tensor)’: neuralnet_pytorch/extensions/include/utils.h:6:3: error: ‘TORCH_CHECK’ was not declared in this scope TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor") ^ neuralnet_pytorch/extensions/include/utils.h:10:3: note: in expansion of macro ‘CHECK_CUDA’ CHECK_CUDA(x); \ ^~~~~~ neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:16:3: note: in expansion of macro ‘CHECK_INPUT’ CHECK_INPUT(xyz1); ^~~ neuralnet_pytorch/extensions/include/utils.h:6:3: note: suggested alternative: ‘AT_CHECK’ TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor") ^ neuralnet_pytorch/extensions/include/utils.h:10:3: note: in expansion of macro ‘CHECK_CUDA’ CHECK_CUDA(x); \ ^~~~~~ neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:16:3: note: in expansion of macro ‘CHECK_INPUT’ CHECK_INPUT(xyz1); neuralnet_pytorch/extensions/include/utils.h:10:3: note: in expansion of macro ‘CHECK_CUDA’ CHECK_CUDA(x); \ ^~~~~~ neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:28:3: note: in expansion of macro ‘CHECK_INPUT’ CHECK_INPUT(xyz1); ^~~ error: command 'gcc' failed with exit status 1

Rolling back uninstall of neuralnet-pytorch Moving to /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+fancy.166.gcbb0c5a-py3.6.egg-info from /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/~euralnet_pytorch-1.0.0+fancy.166.gcbb0c5a-py3.6.egg-info Moving to /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/neuralnet_pytorch/ from /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/~euralnet_pytorch ERROR: Command errored out with exit status 1: /home/yzhang4/anaconda3/envs/graphxx/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c94849168 0c3702bc4723f5f/setup.py'"'"'; file='"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else i o.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cuda-ext install --record /tmp/pip-record-jzi_zomv/ install-record.txt --single-version-externally-managed --compile --install-headers /home/yzhang4/anaconda3/envs/graphxx/include/python3.6m/neuralnet-pytorch Check the logs for full command output.

zhangyahu1 commented 3 years ago

Now I run the code without GPUs. After several epochs for training, there is an error:

Traceback (most recent call last): File "train.py", line 93, in train_valid() File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/config.py", line 1032, in wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/utils.py", line 48, in augment_exception_message_and_reraise six.raise_from(proxy.with_traceback(exception.traceback), None) File "", line 3, in raise_from File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/config.py", line 1009, in wrapper return fn(*new_args, **new_kwargs) File "train.py", line 87, in train_valid valid_freq=val_freq, reduce='mean') File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch/monitor.py", line 932, in run_training raise ValueError('NaN or Inf encountered. Training failed!') ValueError: NaN or Inf encountered. Training failed!

I would appreciate it if you can give me some advice to solve this problem.

justanhduc commented 3 years ago

Hi @zhangyahu1. Could you please give me more details about your conda and pytorch environments? The error comes from TORCH_CHECK which is not available in early versions of Pytorch.

zhangyahu1 commented 3 years ago

Hi @justanhduc, my environment is:

pytorch 1.5.1 torchvision 0.6.1 cudatoolkit 10.1 python 3.6

The code can run now but get the following error:

raise ValueError('NaN or Inf encountered. Training failed!') ValueError: NaN or Inf encountered. Training failed!

justanhduc commented 3 years ago

Hi @zhangyahu1. Are you able to run on GPU now? I think I used Pytorch 1.7 for this code. Could you please try again?

zhangyahu1 commented 3 years ago

Thanks! @justanhduc I will try to use Pytorch 1.7 to run the code.

zhangyahu1 commented 3 years ago

Hi @justanhduc， It works now when I use Pytorch 1.7. However, the code works well with less data, and it raises error: 'CUDA out of memory' within one epoch when all data is used. I wonder if it is because the memory is not released during training.

zhangyahu1 commented 3 years ago

It seems I use wrong verison of neuralnet-pytorch. Then I download neuralnet-pytorch of right verison and run: python setup.py install --cuda-ext, However, it raises the following error when I run the code:

Traceback (most recent call last): File "train.py", line 16, in import neuralnet_pytorch.gin_nnt as gin File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+unknown-py3.6-linux-x86_64.egg/neuralnet_pytorch/init.py", line 38, in import neuralnet_pytorch.ext as ext ImportError: /home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+unknown-py3.6-linux-x86_64.egg/neuralnet_pytorch/ext.cpython-36m-x86_64-linux-gnu.so: undefined symbol: PyThread_tss_create

I would appreciate it if you can give me some suggestions.

justanhduc / graphx-conv

Training process #14

warning \

warning \