Closed borishanzju closed 1 year ago
Hi,
I am not sure which docker container you are running there (what is 393fa1440720
?)
It appears to me that whatever Docker container you are running has some fundamental NVIDIA driver issues that need to be resolved (this appears to be common enough if you search for it on Google).
Also, as a heads up, Jukebox's multi-GPU code is not really explicitly necessary, so if that keeps causing you issues down the line, you could probably get away with getting rid of the rank, local_rank, device = setup_dist_from_mpi()
line (that you show there in your traceback) from main.py
and replacing device
with "cuda".
Hope this helps!
I adopt docker pull jukemir/representations_jukebox this docker
How can I modify the main.py? I find that the container run automatically and I can not enter into the container
I see, that is odd. Worst-case scenario, you can just replace the device
variable with "cpu", and it should (though I have not tested it) work, at least on the CPU, but this has not happened to me with the official Docker image on a machine with a GPU.
You can modify main.py
by adding the --entrypoint bash
flag to the docker run
command. Then, it will give you a shell and you will be able to modify whatever files you want. You may have to apt install nano
or vim to do this. Notice that the original entrypoint is python main.py
, so after making your changes you can run python main.py
inside of that shell.
If you want to make your changes permanent, you could automate those changes within a new Dockerfile that reads something like:
FROM jukemir/representations_jukebox
RUN # patch main.py command here
and then do docker build -t juke_modified .
and then when you run the docker run
command replace the name of the image with juke_modified
.
when I modify the main.py according to your comment, I got the same Error
root@9284416e805e:/code# python main.py
0%| | 0/2858 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 165, in
I see. You could try getting rid of the .cuda()
in the file /code/jukebox/jukebox/vqvae/bottleneck.py
as well. There are a couple other places in the Jukebox code where they explicitly assume CUDA is there I believe.
But I get the new error
Traceback (most recent call last):
File "main.py", line 165, in
Yeah, this is one of Jukebox's multi-node/GPU things that you can also probably get rid of as well in the file /code/jukebox/jukebox/make_models.py
.
get rid of what?
I solved the problem,but I do not have the checkpoint, can you share it?
The checkpoint should download automatically when you run that code (see https://github.com/openai/jukebox/blob/08efbbc1d4ed1a3cef96e08a931944c8b4d63bb3/jukebox/make_models.py#L34).
when i run the shell docker run -it --rm -v xxx/video_trim_audio/:/input -v /xxx/jukemir/wav_jukebox/:/output 393fa1440720
I get the error
Traceback (most recent call last): File "main.py", line 151, in
rank, local_rank, device = setup_dist_from_mpi()
File "/code/jukebox/jukebox/utils/dist_utils.py", line 46, in setup_dist_from_mpi
return _setup_dist_from_mpi(master_addr, backend, port, n_attempts, verbose)
File "/code/jukebox/jukebox/utils/dist_utils.py", line 93, in _setup_dist_from_mpi
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 292, in set_device
torch._C._cuda_setDevice(device)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 196, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 101, in _check_driver
http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx