ERROR train.py: Default process group is not initialized

loretoparisi commented 4 years ago

I get this error when training on a single GPU, when calling the function distributed() to disable tqdm.

To avoid this I have simple wrapped distributed like:

def distributed():
    try:
        return dist.is_available() and dist.is_initialized()
    except:
        return False

loretoparisi commented 4 years ago

[UPDATE] Even with the solution above in distributed I'm getting the same error, since after evaluation it is called again in

def _all_reduce_dict(d, device):
    # wrap in tensor and use reduce to gpu0 tensor
    output_d = {}
    for (key, value) in sorted(d.items()):
        tensor_input = torch.tensor([[value]]).to(device)
        torch.distributed.all_reduce(tensor_input)
        output_d[key] = tensor_input.item()
    return output_d

so the torch.distributed.all_reduce(tensor_input) will fail, so I have changed it like

def _all_reduce_dict(d, device):
    # wrap in tensor and use reduce to gpu0 tensor
    output_d = {}
    for (key, value) in sorted(d.items()):
        tensor_input = torch.tensor([[value]]).to(device)
        if distributed():
            torch.distributed.all_reduce(tensor_input)
        output_d[key] = tensor_input.item()
    return output_d

jongwook commented 4 years ago

Can you paste the output of the following bash script - to check your system information?

curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python -

I suspect two possibilities:

Your NCCL installation is incomplete: try this to (re)install it
You're on Windows: we don't have plan for Windows support it at this point.

loretoparisi commented 4 years ago

@jongwook thank you very much! I have run with both python and python3

ubuntu@deepblue:~$ curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12461  100 12461    0     0  27635      0 --:--:-- --:--:-- --:--:-- 27691
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect

Python version: 2.7
Is CUDA available: N/A
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5

Versions of relevant libraries:
[pip] numpy==1.15.4
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.3                      199  
[conda] mkl-service               1.1.2            py37he904b0f_5  
[conda] mkl_fft                   1.0.10           py37ha843d7b_0  
[conda] mkl_random                1.0.2            py37hd81dba3_0
ubuntu@deepblue:~$ curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python3 -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12461  100 12461    0     0  56586      0 --:--:-- --:--:-- --:--:-- 56384
Collecting environment information...
PyTorch version: 1.3.1
Is debug build: No
CUDA used to build PyTorch: 10.1.243

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5

Versions of relevant libraries:
[pip] numpy==1.15.4
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.3                      199  
[conda] mkl-service               1.1.2            py37he904b0f_5  
[conda] mkl_fft                   1.0.10           py37ha843d7b_0  
[conda] mkl_random                1.0.2            py37hd81dba3_0

jongwook commented 4 years ago

I still suspect the NCCL install is the culprit; realized that that script doesn't check for NCCL version.. which can be checked torch.cuda.nccl.version()..

Can you run the experiments fine with your proposed changes? I could incorporate them in this repo at some point, but this repo (as most other OpenAI repos are) is in archive status and update is not our priority.

loretoparisi commented 4 years ago

@jongwook yes I can make it working with the changes I did so far. I will further investigate NCLL by the way. Thanks.

openai / gpt-2-output-dataset

ERROR train.py: Default process group is not initialized #16