The program hangs when runs into parallel_apply() function in util.py

butterluo commented 3 years ago

I ran main_task_retrieval.py as README said, but when a epoch was finished and runing eval_epoch() function in main_task_retrieval.py. But when the grogram invoke parallel_apply() in eval_epoch(), it hang at the line 'modules = nn.parallel.replicate(model, device_ids)' in parallel_apply() function in util.py.

In this moment, if the NCCL_DEBUG was turn on by setting 'export NCCL_DEBUG=INFO', the messages will be show below:

210109:155288 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64 210109:155287 [0] NCCL INFO Channel 00/02 : 0 1 210109:155288 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1 210109:155287 [0] NCCL INFO Channel 01/02 : 0 1 210109:155288 [1] NCCL INFO Setting affinity for GPU 5 to ffff,f00000ff,fff00000 210109:155287 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64 210109:155287 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1 210109:155287 [0] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff 210109:155287 [0] NCCL INFO Channel 00 : 0[66000] -> 1[b9000] via direct shared memory 210109:155288 [1] NCCL INFO Channel 00 : 1[b9000] -> 0[66000] via direct shared memory 210109:155287 [0] NCCL INFO Channel 01 : 0[66000] -> 1[b9000] via direct shared memory 210109:155288 [1] NCCL INFO Channel 01 : 1[b9000] -> 0[66000] via direct shared memory 210109:155287 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 210109:155287 [0] NCCL INFO comm 0x7f7e14003240 rank 0 nranks 2 cudaDev 0 busId 66000 - Init COMPLETE 210109:155288 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 210109:155288 [1] NCCL INFO comm 0x7f7eb8003010 rank 1 nranks 2 cudaDev 1 busId b9000 - Init COMPLETE

ArrowLuo commented 3 years ago

Hi @butterluo, I found nothing wrong with your given messages. As I know, the retrieval eval is indeed slow. You can debug with a small part of the test set (e.g., 50 pairs), or print something in _run_on_single_gpu() here to see whether the program is still running.

butterluo commented 3 years ago

Hi @butterluo, I found nothing wrong with your given messages. As I know, the retrieval eval is indeed slow. You can debug with a small part of the test set (e.g., 50 pairs), or print something in _run_on_single_gpu() here to see whether the program is still running.

I tried last night, but nothing print out, and the log are still hanging on the last line 'NCCL INFO comm 0x7f7eb8003010 rank 1 nranks 2 cudaDev 1 busId b9000 - Init COMPLETE' for the whole night....

How long running the eval_epoch() function cost in your experience?

ArrowLuo commented 3 years ago

Different test dataset has different time-cost. On average, less than half an hour.

I have no idea about your problem now. A choice is to comment line 406-441 and only call _run_on_single_gpu() temporally.

I also want to make sure that your modules = nn.parallel.replicate(model, device_ids) has valid device_ids and correspondingly valid GPUs. Thanks.

butterluo commented 3 years ago

Different test dataset has different time-cost. On average, less than half an hour.

I have no idea about your problem now. A choice is to comment line 406-441 and only call _run_on_single_gpu() temporally.

I also want to make sure that your modules = nn.parallel.replicate(model, device_ids) has valid device_ids and correspondingly valid GPUs. Thanks.

When i was using more than 1 gpu, i set 'export CUDA_VISIBLE_DEVICES=3,4' and the device_ids which is passed into 'nn.parallel.replicate(model, device_ids)' is '[0,1]' . Is there any thing wrong?

But when i ran it with 1 gpu, every thing is ok.

ArrowLuo commented 3 years ago

I have no idea about this bug now. I tested on P40, P100, and V100, and all of them work well. Can you tell me your GPUs' version and pytorch's version?

butterluo commented 3 years ago

I have no idea about this bug now. I tested on P40, P100, and V100, and all of them work well. Can you tell me your GPUs' version and pytorch's version?

Python version: 3.7 (64-bit runtime) Tesla V100-PCIE-32GB CUDA runtime version: 10.0.130 torch==1.8.1

ArrowLuo commented 3 years ago

Feel free to reopen if any progress on this issue

microsoft / UniVL

The program hangs when runs into parallel_apply() function in util.py #12