Closed ghost closed 5 years ago
Hi,
There is a major change on pytorch from v0.3 to v0.4, I'm migrating the code to support those changes. In the meanwhile I recommend to keep pytorch0.3.1.
Your GPU needs cuda >9.0, so please install pytorch 0.3.1 with cuda 9.1 using:
pip uninstall torch torchvision
pip install https://download.pytorch.org/whl/cu91/torch-0.3.1-cp36-cp36m-linux_x86_64.whl
More info about previous pytorch version on pytorch page
pip uninstall torch torchvision
pip install https://download.pytorch.org/whl/cu91/torch-0.3.1-cp36-cp36m-linux_x86_64.whl
(p3p) home@home-lnx:~/Desktop/programs/P2PaLA$ python P2PaLA.py --config config_BL_only.txt --tr_data ./data/train --te_data ./data/test --log_comment "_foo"
2019-01-21 15:37:56,527 - optparse - INFO - Reading configuration from config_BL_only.txt
2019-01-21 15:37:56,529 - P2PaLA - INFO - Working on training stage...
2019-01-21 15:37:56,529 - P2PaLA - WARNING - tensorboardX is not installed, display logger set to OFF.
2019-01-21 15:37:56,529 - P2PaLA - INFO - Preprocessing data from ./data/train
Traceback (most recent call last):
File "P2PaLA.py", line 1262, in <module>
main()
File "P2PaLA.py", line 528, in main
y_gen = nnG(x)
File "/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/home/Desktop/programs/P2PaLA/nn_models/models.py", line 94, in forward
return self.model(input_x)
File "/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/home/Desktop/programs/P2PaLA/nn_models/models.py", line 184, in forward
return F.log_softmax(self.model(input_x), dim=1)
File "/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED
I don't think the issue is related to your Ubuntu version. But you need to install the right combination of cuda and pytorch for sure. If you have installed cuda 9.1 and python 3.6 the command I post before should work, but If you have another combination, like cuda 9.0 or python 2.7 you need to find the right pythorch for it (on pytorch web).
I just test it using python 3.5, cuda9.1 on a GTX 1080 and a TITAN X and it works (I don't have a RTX to test it)
Same error, even after installing Cuda 9.1
(p3p) home@home-lnx:~/Desktop/programs/P2PaLA$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 410.48 Thu Sep 6 06:36:33 CDT 2018
GCC version: gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)
(p3p) home@home-lnx:~/Desktop/programs/P2PaLA$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
hmmmm.... it seems that RTX cards don't support Cuda 9.1, that's weird.
Will you consider supporting Cuda 10 via Pytorch 1?
I'm migrating the code to support those changes.
Yes, my goal is to migrate all the code to the latest version of pytorch, but now i'm a bit short of time and I don't think I will release a new version in the following couple of weeks. Thanks for spotting out the issue with new GPU's. I will try to migrate the code as soon as posible.
In the meanwhile, you can use the tool for inference using the pre-trained model available on CPU (just add the option --gpu -1).
Hoping that you support Cuda 10, Thank you
home@home-lnx:~/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce RTX 2070"
CUDA Driver Version / Runtime Version 10.0 / 9.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7951 MBytes (8337227776 bytes)
MapSMtoCores for SM 7.5 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 7.5 is undefined. Default to use 64 Cores/SM
(36) Multiprocessors, ( 64) CUDA Cores/MP: 2304 CUDA Cores
GPU Max Clock rate: 1815 MHz (1.81 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 46 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 9.1, NumDevs = 1
Result = PASS
@lquirosd Can you share this information, which versions are you using:
Yes, the software have been tested on several configurations Cudnn: 5,6,7 Cuda: 8,9 Pytorch: 0.3* Python 2.7, 3.5, 3.6 Os: for training: Ubuntu 16.04, for test: Ubuntu 16.04, Mac OS 10.13
My current set-up is
>>> sys.version
'3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n[GCC 7.3.0]'
>>> torch.__version__
'0.3.1'
>>> torch.version.cuda
'8.0.61'
>>> torch.backends.cudnn.version()
7005:
It seems that the problem is caused because RTX cards only support versions of Cuda 10+ and having compute capability 7.5, which the Nvidia forums confirmed to me.
@lquirosd Will you consider upgrading to Pytorch 1.0 ? Note: CUDA 10 support for compute capability 3.0 – 7.5 (Kepler, Maxwell, Pascal, Volta, Turing)
Hi, Did you change the "batch_size" parameter to fit your card? I mean, default is 8 images per mini-batch, but RTX 2070 memory is only 8GB. I think it'll support a max mini-batch of 4 images or so. Can you please run a experiment using a small mini-batch?
@lquirosd
This is not a memory issue, RTX cards (Turing) initial support is at Cuda 10, Pytorch 1.0 supports Cuda 10 / 9 / 8 versions. So the only solution is by upgrading the code to Pytorch 1.0
I just release a new branch for Pytorch 1.0:
git clone --single-branch --branch PyTorch-v1.0 https://github.com/lquirosd/P2PaLA.git
Please notice this branch is not fully tested, so some bugs can be around. I ran some test on Pytorch: 1.0.0, CUDA: 9.0 and cudnn:7401, but cuda 10 is untested
May you find peace in your life. Thank you
On my Linux mint 19.1 using an RTX 2070
When trying to recognize using the default installation:
So I installed latest torch and torchvision:
Then ran recognition:
Now the problem is when trying to train