Closed butterluo closed 3 years ago
Hi @butterluo, I found nothing wrong with your given messages. As I know, the retrieval eval is indeed slow. You can debug with a small part of the test set (e.g., 50 pairs), or print something in _run_on_single_gpu()
here to see whether the program is still running.
Hi @butterluo, I found nothing wrong with your given messages. As I know, the retrieval eval is indeed slow. You can debug with a small part of the test set (e.g., 50 pairs), or print something in
_run_on_single_gpu()
here to see whether the program is still running.
I tried last night, but nothing print out, and the log are still hanging on the last line 'NCCL INFO comm 0x7f7eb8003010 rank 1 nranks 2 cudaDev 1 busId b9000 - Init COMPLETE' for the whole night....
How long running the eval_epoch() function cost in your experience?
Different test dataset has different time-cost. On average, less than half an hour.
I have no idea about your problem now. A choice is to comment line 406-441 and only call _run_on_single_gpu()
temporally.
I also want to make sure that your modules = nn.parallel.replicate(model, device_ids)
has valid device_ids
and correspondingly valid GPUs. Thanks.
Different test dataset has different time-cost. On average, less than half an hour.
I have no idea about your problem now. A choice is to comment line 406-441 and only call
_run_on_single_gpu()
temporally.I also want to make sure that your
modules = nn.parallel.replicate(model, device_ids)
has validdevice_ids
and correspondingly valid GPUs. Thanks.
When i was using more than 1 gpu, i set 'export CUDA_VISIBLE_DEVICES=3,4' and the device_ids which is passed into 'nn.parallel.replicate(model, device_ids)' is '[0,1]' . Is there any thing wrong?
But when i ran it with 1 gpu, every thing is ok.
I have no idea about this bug now. I tested on P40, P100, and V100, and all of them work well. Can you tell me your GPUs' version and pytorch's version?
I have no idea about this bug now. I tested on P40, P100, and V100, and all of them work well. Can you tell me your GPUs' version and pytorch's version?
Python version: 3.7 (64-bit runtime) Tesla V100-PCIE-32GB CUDA runtime version: 10.0.130 torch==1.8.1
Feel free to reopen if any progress on this issue
I ran main_task_retrieval.py as README said, but when a epoch was finished and runing eval_epoch() function in main_task_retrieval.py. But when the grogram invoke parallel_apply() in eval_epoch(), it hang at the line 'modules = nn.parallel.replicate(model, device_ids)' in parallel_apply() function in util.py.
In this moment, if the NCCL_DEBUG was turn on by setting 'export NCCL_DEBUG=INFO', the messages will be show below:
210109:155288 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64 210109:155287 [0] NCCL INFO Channel 00/02 : 0 1 210109:155288 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1 210109:155287 [0] NCCL INFO Channel 01/02 : 0 1 210109:155288 [1] NCCL INFO Setting affinity for GPU 5 to ffff,f00000ff,fff00000 210109:155287 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64 210109:155287 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1 210109:155287 [0] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff 210109:155287 [0] NCCL INFO Channel 00 : 0[66000] -> 1[b9000] via direct shared memory 210109:155288 [1] NCCL INFO Channel 00 : 1[b9000] -> 0[66000] via direct shared memory 210109:155287 [0] NCCL INFO Channel 01 : 0[66000] -> 1[b9000] via direct shared memory 210109:155288 [1] NCCL INFO Channel 01 : 1[b9000] -> 0[66000] via direct shared memory 210109:155287 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 210109:155287 [0] NCCL INFO comm 0x7f7e14003240 rank 0 nranks 2 cudaDev 0 busId 66000 - Init COMPLETE 210109:155288 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 210109:155288 [1] NCCL INFO comm 0x7f7eb8003010 rank 1 nranks 2 cudaDev 1 busId b9000 - Init COMPLETE