p-lambda / jukemir

Perform transfer learning for MIR using Jukebox!
MIT License
172 stars 22 forks source link

AssertionError: Found no NVIDIA driver on your system #9

Closed borishanzju closed 1 year ago

borishanzju commented 1 year ago

when i run the shell docker run -it --rm -v xxx/video_trim_audio/:/input -v /xxx/jukemir/wav_jukebox/:/output 393fa1440720

I get the error

Traceback (most recent call last): File "main.py", line 151, in rank, local_rank, device = setup_dist_from_mpi() File "/code/jukebox/jukebox/utils/dist_utils.py", line 46, in setup_dist_from_mpi return _setup_dist_from_mpi(master_addr, backend, port, n_attempts, verbose) File "/code/jukebox/jukebox/utils/dist_utils.py", line 93, in _setup_dist_from_mpi torch.cuda.set_device(local_rank) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 292, in set_device torch._C._cuda_setDevice(device) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 196, in _lazy_init _check_driver() File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 101, in _check_driver http://www.nvidia.com/Download/index.aspx""") AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

rodrigo-castellon commented 1 year ago

Hi, I am not sure which docker container you are running there (what is 393fa1440720?)

It appears to me that whatever Docker container you are running has some fundamental NVIDIA driver issues that need to be resolved (this appears to be common enough if you search for it on Google).

Also, as a heads up, Jukebox's multi-GPU code is not really explicitly necessary, so if that keeps causing you issues down the line, you could probably get away with getting rid of the rank, local_rank, device = setup_dist_from_mpi() line (that you show there in your traceback) from main.py and replacing device with "cuda".

Hope this helps!

borishanzju commented 1 year ago

I adopt docker pull jukemir/representations_jukebox this docker

borishanzju commented 1 year ago

How can I modify the main.py? I find that the container run automatically and I can not enter into the container

rodrigo-castellon commented 1 year ago

I see, that is odd. Worst-case scenario, you can just replace the device variable with "cpu", and it should (though I have not tested it) work, at least on the CPU, but this has not happened to me with the official Docker image on a machine with a GPU.

You can modify main.py by adding the --entrypoint bash flag to the docker run command. Then, it will give you a shell and you will be able to modify whatever files you want. You may have to apt install nano or vim to do this. Notice that the original entrypoint is python main.py, so after making your changes you can run python main.py inside of that shell.

If you want to make your changes permanent, you could automate those changes within a new Dockerfile that reads something like:

FROM jukemir/representations_jukebox

RUN # patch main.py command here

and then do docker build -t juke_modified . and then when you run the docker run command replace the name of the image with juke_modified.

borishanzju commented 1 year ago

when I modify the main.py according to your comment, I got the same Error

root@9284416e805e:/code# python main.py 0%| | 0/2858 [00:00<?, ?it/s] Traceback (most recent call last): File "main.py", line 165, in setup_hparams(vqvae, dict(sample_length=1048576)), "cpu" File "/code/jukebox/jukebox/make_models.py", line 92, in make_vqvae **block_kwargs) File "/code/jukebox/jukebox/vqvae/vqvae.py", line 79, in init self.bottleneck = Bottleneck(l_bins, emb_width, mu, levels) File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 189, in init self.level_blocks.append(level_block(level)) File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 186, in level_block = lambda level: BottleneckBlock(l_bins, emb_width, mu) File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 13, in init self.reset_k() File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 20, in reset_k self.register_buffer('k', t.zeros(self.k_bins, self.emb_width).cuda()) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 196, in _lazy_init _check_driver() File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 101, in _check_driver http://www.nvidia.com/Download/index.aspx""") AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

rodrigo-castellon commented 1 year ago

I see. You could try getting rid of the .cuda() in the file /code/jukebox/jukebox/vqvae/bottleneck.py as well. There are a couple other places in the Jukebox code where they explicitly assume CUDA is there I believe.

borishanzju commented 1 year ago

But I get the new error

Traceback (most recent call last): File "main.py", line 165, in setup_hparams(vqvae, dict(sample_length=1048576)), "cpu" File "/code/jukebox/jukebox/make_models.py", line 95, in make_vqvae restore_model(hps, vqvae, hps.restore_vqvae) File "/code/jukebox/jukebox/make_models.py", line 55, in restore_model checkpoint = load_checkpoint(checkpoint_path) File "/code/jukebox/jukebox/make_models.py", line 29, in load_checkpoint if dist.get_rank() % 8 == 0: File "/code/jukebox/jukebox/utils/dist_adapter.py", line 23, in get_rank return _get_rank() File "/code/jukebox/jukebox/utils/dist_adapter.py", line 65, in _get_rank return dist.get_rank() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 564, in get_rank _check_default_pg() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized

rodrigo-castellon commented 1 year ago

Yeah, this is one of Jukebox's multi-node/GPU things that you can also probably get rid of as well in the file /code/jukebox/jukebox/make_models.py.

borishanzju commented 1 year ago

get rid of what?

borishanzju commented 1 year ago

I solved the problem,but I do not have the checkpoint, can you share it?

rodrigo-castellon commented 1 year ago

The checkpoint should download automatically when you run that code (see https://github.com/openai/jukebox/blob/08efbbc1d4ed1a3cef96e08a931944c8b4d63bb3/jukebox/make_models.py#L34).