lzhengning / SubdivNet

Subdivision-based Mesh Convolutional Networks.
MIT License
247 stars 34 forks source link

How many gpus do you utilize? #7

Closed liang3588 closed 3 years ago

liang3588 commented 3 years ago

”name: manifold40 Train 0: 0%| | 0/2049 [00:00<?, ?it/s] Compiling Operators(1/1) used: 19.2s eta: 0s

Compiling Operators(1/1) used: 13s eta: 0s

Compiling Operators(1/1) used: 13s eta: 0s

Compiling Operators(1/1) used: 13.8s eta: 0s

Compiling Operators(56/56) used: 78.5s eta: 0s

Compiling Operators(44/44) used: 60.4s eta: 0s

Compiling Operators(1/1) used: 16s eta: 0s Train 0: 0%| | 1/2049 [03:56<134:34:05, 236.55s/it] Compiling Operators(2/2) used: 19.3s eta: 0s

Compiling Operators(1/1) used: 14.4s eta: 0s Train 0: 1%|▋ | 21/2049 [04:52<1:14:44, 2.21s/it][w 0624 09:21:26.817385 20 cudnn_conv_Tx:float32Ty:float32Tw:float32XFORMAT:abcdWFORMAT:oihwYFORMAT:abcdJ...hash:798cb5ed49dadaa2op.cc:200] forward algorithm cache is full [w 0624 09:21:27.423196 20 cudnn_conv_backward_w_Tx:float32Ty:float32Tw:float32XFORMAT:abcdWFORMAT:oihw__YFOR...hash:c41e3d43aa5d4cf7_op.cc:199] backward w algorithm cache is full Train 0: 1%|▊ | 24/2049 [04:56<56:14, 1.67s/it][w 0624 09:21:31.279631 20 cudnn_conv_backward_x_Tx:float32Ty:float32Tw:float32XFORMAT:abcdWFORMAT:oihw__YFOR...hash:74f24b7a5fa4fe17_op.cc:201] backward x algorithm cache is full Train 0: 65%|███████████████████████████████████████████▌ | 1332/2049 [1:21:43<51:18, 4.29s/it] “ The training process consume too much time. I utilize 1 gpu (12G) to train. An hour later, Train 0 is not over. I noticed that you mentioned that using multiple gpus could speed up the trainning, so I want to know how many gpus do you utilize and how long is the training time?

Thank you very much!

lzhengning commented 3 years ago

There are 10 randomly remeshed variants for a shape in the downloaded data, named 'Manifold40-MAPS-96-3'. So, there are much more iterations in a epoch than the way that augments data on the fly.

As for multiple gpus, I usually use 2 TITAN RTXs.

amiltonwong commented 3 years ago

Hi, @liang3588 ,

I also encountered a similar situation as yours. And I know that the GPU card is not running (both the GPU temp and power usage drop to lower value, which is abnormal in a training process). You can check it by command nvidia-smi -l 2 when you're stucked

amiltonwong commented 3 years ago

Here is my supplement for the stuck situation:

Thu Jun 24 00:00:42 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:01:00.0  On |                  N/A |
| 27%   41C    P8    10W / 250W |   6801MiB / 12194MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1049      G   /usr/lib/xorg/Xorg                           410MiB |
|    0      2083      G   compiz                                       176MiB |
|    0      3667      G   ...equest-channel-token=115321121760856269   104MiB |
|    0      5258      G   /tmp/.mount_Link tc5LG7H/usr/bin/meshlab      12MiB |
|    0     31541      C   python3.7                                    373MiB |
+-----------------------------------------------------------------------------+

@lzhengning , any suggestion ? Thanks~

lzhengning commented 3 years ago

Hi, @amiltonwong

Could you please provide your running logs?

amiltonwong commented 3 years ago

Hi, @lzhengning ,

The download link for the running log.

lzhengning commented 3 years ago

@amiltonwong, the tensorboard log file shows that 95 iterations took 52 seconds. This seems to be a normal speed of training.

But I am also confused that your GPU occupation is abnormally low. When we import jittor in Python, jittor will print the information of the GPU environment to screen. Can you just copy the screen output when running the training scripts?

lzhengning commented 3 years ago

@liang3588, I check the running scripts again in a single TITAN RTX (24GB). It takes about 20 minutes to finish an epoch. The GPU memory consumed is about 18.7 GB. NIVIDIA-SMI shows that the Volatile GPU-Util is about 90%.

I do not know how much memory your GPU has. But if it is smaller than 18.7 GB, this may be the reason that your program is slower than expected. One feature of Jittor is unified memory, which allows the program to make use of CPU memory when GPU memory is all occupied. This feature avoid crashing due to OOM, but using CPU memory is also slower. So if you run out of GPU memory, you can try to decrease the batch size.

amiltonwong commented 3 years ago

@lzhengning , you're right. It's the issue that extra memory is contributed by CPU and hence the very slow performance I got. The issue is fixed after setting the batch_size to a lower value. BTW, one should incrementally increase the batch size to check if encountering this issue. Does jittor has some options to disable using extra CPU memory and just return out of GPU memory when the consumption on GPU exceeds the limit?

lzhengning commented 3 years ago

@amiltonwong , you can set the environment variable use_cuda_managed_allocator=0 to turn off unified memory. This feature is experimental, and be cautious to use it.

Instead, I would rather suggest to check nvidia-smi after several iterations to make sure the GPU memory is not fully occupied. The unified memory is useful when some other programs use the GPU when I am training. This mechanism can also handle variable input size, such as Manifold40. Anyway, being slower is better than crashing.

amiltonwong commented 3 years ago

Hi, @lzhengning, Thanks for your suggestion. I turn off unified memory by setting env variable use_cuda_managed_allocator=0. However, when I increase batch_size to 4 (the GPU usage is around 4GB), I encounter the folllowing error after completing the first epoch:

(cuda11.2_jittor) root@milton-LabPC:/media/root/mdata/data/code14/SubdivNet# sh scripts/shrec11-split10/train.sh
[i 0628 06:10:28.976273 60 compiler.py:869] Jittor(1.2.3.47) src: /root/anaconda3/envs/cuda11.2_jittor/lib/python3.7/site-packages/jittor
[i 0628 06:10:28.981315 60 compiler.py:870] g++ at /usr/bin/g++(7.5.0)
[i 0628 06:10:28.981453 60 compiler.py:871] cache_path: /root/.cache/jittor/default/g++
[i 0628 06:10:28.987231 60 compiler.py:817] Found nvcc(11.2.67) at /usr/local/cuda-11.2/bin/nvcc
[i 0628 06:10:29.405254 60 __init__.py:286] Found gdb(7.11.1) at /usr/bin/gdb.
[i 0628 06:10:29.449550 60 __init__.py:286] Found addr2line(2.26.1) at /usr/bin/addr2line.
[i 0628 06:10:29.496426 60 compiler.py:951] py_include: -I/root/anaconda3/envs/cuda11.2_jittor/include/python3.7m -I/root/anaconda3/envs/cuda11.2_jittor/include/python3.7m
[i 0628 06:10:29.518145 60 compiler.py:953] extension_suffix: .cpython-37m-x86_64-linux-gnu.so
[i 0628 06:10:29.732812 60 compiler.py:1086] OS type:ubuntu OS key:ubuntu
[i 0628 06:10:29.733696 60 __init__.py:178] Total mem: 62.82GB, using 16 procs for compiling.
[i 0628 06:10:29.896199 60 cuda_managed_allocator.cc:15] Load use_cuda_managed_allocator: 0
[i 0628 06:10:30.053064 60 jit_compiler.cc:21] Load cc_path: /usr/bin/g++
[i 0628 06:10:30.053086 60 jit_compiler.cc:24] Load nvcc_path: /usr/local/cuda-11.2/bin/nvcc
[i 0628 06:10:30.053254 60 init.cc:55] Found cuda archs: [86,]
[i 0628 06:10:30.119039 60 compile_extern.py:444] mpicc not found, distribution disabled.
[i 0628 06:10:30.232331 60 compile_extern.py:20] found /usr/local/cuda-11.2/include/cublas.h
[i 0628 06:10:30.240068 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcublas.so
[i 0628 06:10:30.240183 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcublasLt.so.11
[i 0628 06:10:30.879192 60 compile_extern.py:20] found /usr/local/cuda-11.2/include/cudnn.h
[i 0628 06:10:30.913814 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcudnn.so.8
[i 0628 06:10:30.913917 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcudnn_ops_infer.so.8
[i 0628 06:10:30.949406 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcudnn_ops_train.so.8
[i 0628 06:10:30.958938 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcudnn_cnn_infer.so.8
[i 0628 06:10:31.175849 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcudnn_cnn_train.so.8
[i 0628 06:10:31.237961 60 compiler.py:667] handle pyjt_include/root/anaconda3/envs/cuda11.2_jittor/lib/python3.7/site-packages/jittor/extern/cuda/cudnn/inc/cudnn_warper.h
[i 0628 06:10:31.892025 60 compile_extern.py:20] found /usr/local/cuda-11.2/include/curand.h
[i 0628 06:10:31.937643 60 compile_extern.py:20] found /usr/local/cuda-11.2/lib64/libcurand.so
[i 0628 06:10:31.976521 60 cuda_flags.cc:26] CUDA enabled.
name:  shrec11-split10
Train 0:   0%|                                                                  | 0/1575 [00:00<?, ?it/s]
Compiling Operators(1/1) used: 4.12s eta:    0s 

Compiling Operators(1/1) used: 2.93s eta:    0s 

Compiling Operators(1/1) used: 3.76s eta:    0s 

Compiling Operators(1/1) used:  4.1s eta:    0s 

Compiling Operators(56/56) used: 29.7s eta:    0s 

Compiling Operators(43/43) used: 22.6s eta:    0s 

Compiling Operators(1/1) used: 3.87s eta:    0s 
Train 0:   0%|                                                       | 1/1575 [01:22<35:53:32, 82.09s/it]
Compiling Operators(2/2) used: 4.31s eta:    0s 

Compiling Operators(1/1) used: 4.14s eta:    0s 
Train 0: 100%|███████████████████████████████████████████████████████| 1575/1575 [04:24<00:00,  5.94it/s]
train acc =  0.29095238095238096
Test 0: 100%|████████████████████████████████████████████████████████| 1575/1575 [01:56<00:00, 13.49it/s]
test acc =  0.4176190476190476
test acc [voted] =  0.45
Traceback (most recent call last):
  File "train_cls.py", line 185, in <module>
    test(net, test_dataset, writer, epoch, args)
  File "/root/anaconda3/envs/cuda11.2_jittor/lib/python3.7/site-packages/jittor/__init__.py", line 257, in inner
    ret = func(*args, **kw)
  File "train_cls.py", line 86, in test
    net.save(os.path.join('checkpoints', name, f'vacc-{vacc:.4f}.pkl'))
  File "/root/anaconda3/envs/cuda11.2_jittor/lib/python3.7/site-packages/jittor/__init__.py", line 967, in save
    params_dict[p.name()] = p.data
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.data)).

Types of your inputs are:
 self   = Var,

The function declarations are:
 inline DataView data()

Failed reason:[f 0628 06:16:58.722908 60 helper_cuda.h:126] CUDA error at /root/anaconda3/envs/cuda11.2_jittor/lib/python3.7/site-packages/jittor/src/mem/allocator.cc:110  code=1( cudaErrorInvalidValue ) cudaMemcpy(a.ptr, var->mem_ptr, var->size, cudaMemcpyDeviceToHost)

Is that something special after each epoch? Could you give some hints to fix this issue? Thanks~

lzhengning commented 3 years ago

The last line of the log says that there are problems when copying GPU variables to CPU. I guess this is related to switching off unified memory. Will it crash if you do not set the env variable?

amiltonwong commented 3 years ago

@lzhengning , You're right. If I don't switch off the unified memory. It runs smoothly without error. Perhaps the option switching off unified memory may not always function properly.