Closed dzorlu closed 3 years ago
Thanks for all the info. Could you run accelerate test
and paste here the output?
Thanks for the fast response
Running: accelerate-launch /usr/local/lib/python3.6/dist-packages/accelerate/test_utils/test_script.py --config_file=None
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: NO
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda
stdout: Use FP16 precision: False
stdout:
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout:
stdout: **Training integration test**
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
Test is a success! You are ready for your distributed training!
Ok, so it looks like your config is not recognized (it doesn't launch 2 processes). So the problem is here, not in your training script.
Are you sure it's the one in ~/.cache/huggingface/accelerate/default_config.yaml
? You don't have some environment variable that changes the cache directory in any wat?
This seems to be a false alarm, the process now sees both GPUs. Thank you for the quick turnaround. Can't wait to use the library more. Deniz
Closing the issue then, but feel free to reopen if you get the problem again!
Hi- Thanks for the great library, Sylvain!
The config file looks as follows:
The relevant part of the code is as follows:
The script utilizes a single GPU, though there are 2 GPUS.
Launching the scipt in the command line:
The print statement
print(accelerator.device)
returns following (happy to add more debugging)Any help is appreciated. Thank you!