phillipi / pix2pix

Image-to-image translation with conditional adversarial nets
https://phillipi.github.io/pix2pix/
Other
10.13k stars 1.71k forks source link

THCudaCheck FAIL : out of memory on train.lua and test.sh #99

Open ashleyjamesbrown opened 7 years ago

ashleyjamesbrown commented 7 years ago

New iMAC with new fresh install and having some problems installing pixtopix / torch (had to downgrade the CLT to enable running install-deps and other commands) and just when I think its going to work on a train command I get this error which i'm sure is GPU related and it appears in the ./test.sh output as well


THCudaCheck FAIL file=/Users/ashleyjamesbrown/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory

stack traceback:
    [C]: in function 'resize'
    ...shleyjamesbrown/torch/install/share/lua/5.1/nn/utils.lua:11: in function 'torch_Storage_type'
    ...shleyjamesbrown/torch/install/share/lua/5.1/nn/utils.lua:57: in function 'recursiveType'
    ...hleyjamesbrown/torch/install/share/lua/5.1/nn/Module.lua:160: in function 'type'
    ...mesbrown/torch/install/share/lua/5.1/nngraph/gmodule.lua:258: in function 'cuda'
    train.lua:190: in main chunk
    [C]: in function 'dofile'
    ...rown/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x0106db9330

Im trying to train 10 images on the GPU so i doubt its actually running out of memory ?

Have recently tried a clean and update in the torch directory.

Installed CUDA and succesfully did make on a few of the samples and ran them succesfully so i'm sure that cuda itself is installed ok.

System: Mac OSX 10.12.5 3.1 GHz Core i7 Geforce GT750m 1024

Xcode 8.3.3 CLT installed and switched to 8.2

CUDA 8.0.83 GPU Driver Version: 10.17.5 355.10.05.45f01

Installed cuDNN 5.1 for CUDA 8 (cuDNN 6 is on machine but it didn't work linking so its in a backup folder)

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Tue_Jan_10_13:22:46_CST_2017 Cuda compilation tools, release 8.0, V8.0.61

149 0 0xffffff7f83889000 0x2000 0x2000 com.nvidia.CUDA (1.1.0) DD792765-CA28-395A-8593-D6837F05C4FF <4 1>

I've been through a lot of google searching and trying various things but its not coming up trumps.

If this should be in the torch issues instead then let me know and ill remove / move

ashleyjamesbrown commented 7 years ago
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 750M"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073283072 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            926 MHz (0.93 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 750M
Result = PASS
ashleyjamesbrown commented 7 years ago
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 750M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         4997.3

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         9886.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         45863.0

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
junyanz commented 7 years ago

I believe that 1GB GPU memory might not be enough for training the model. You can try to train a model on low-resolution images (e.g. loadSize=143, fineSize=128.)

ashleyjamesbrown commented 7 years ago

@junyanz When running ./test.sh from the torch directory though I get similar errors thrown up on certain models that it tries to test ? Yeh the cuda examples all run fine so is torch just needing more gpu then ?

I'd be pretty sad if my new machine i just bought with an nividia gpu wasnt good enough. I already had a mac running cpu commands but taking days so i purchased another and its no better.

I tried your comment but this also failed, even setting size as low as 10

Wonder if its because i have cuda 5.1 but if i change up to cuda 6 then I get binding errors ?

ashleyjamesbrown commented 7 years ago

If i change to use cuda 6 then i get no errors in ./test.sh but i do get bindings error instead.

Found Environment variable CUDNN_PATH = /Users/ashleyjamesbrown/cuda6/lib/libcudnn.6.dylib/Users/ashleyjamesbrown/torch/install/bin/luajit: ...hleyjamesbrown/torch/install/share/lua/5.1/cudnn/ffi.lua:1618: These bindings are for CUDNN 5.x (5005 <= cudnn.version > 6000) , while the loaded CuDNN is version: 6021  
Are you using an older or newer version of CuDNN?
stack traceback:
    [C]: in function 'error'
    ...hleyjamesbrown/torch/install/share/lua/5.1/cudnn/ffi.lua:1618: in main chunk
    [C]: in function 'require'
    ...leyjamesbrown/torch/install/share/lua/5.1/cudnn/init.lua:4: in main chunk
    [C]: at 0x0101aaae10
    [C]: at 0x0101a2e330
ashleyjamesbrown commented 7 years ago

Thought i would update. Fixed the binding errors with the torch fix from soumith with the v6 branch Still get gpu errors running the test.sh script inside torch and yet i can compile cuda examples and run them fine. Cannot run gpu for training. I ran a geekbench check and it came up with 2 gpu - intel iris pro and nividia geforce so i tried altering the gpu=1 lien in case it was trying to select the iris but this gave me an error that it didnt have a gpu

In the end i stayed with cpu train which took longer longer longer. Ill come back to trying the gpu again in the future. And i just saw cuda 9 release so as with this stuff - things move etc and ill try again in a month or so.