Error when running sampling example

maraoz commented 4 years ago

I followed the Install instructions and then ran the sampling command and got:

$ python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Caught error during NCCL init (attempt 0 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 1 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 2 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 3 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 4 of 5): Distributed package doesn't have NCCL built in
Traceback (most recent call last):
  File "jukebox/sample.py", line 237, in <module>
    fire.Fire(run)
  File "/Users/manu/opt/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/Users/manu/opt/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/Users/manu/opt/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "jukebox/sample.py", line 229, in run
    rank, local_rank, device = setup_dist_from_mpi(port=port)
  File "/Users/manu/git/jukebox/jukebox/utils/dist_utils.py", line 86, in setup_dist_from_mpi
    raise RuntimeError("Failed to initialize NCCL")
RuntimeError: Failed to initialize NCCL

I tried googling around about this NCCL error (I have no idea what NCCL is), but couldn't find any solutions. Any idea on how to fix this? Thanks!

carchrae commented 4 years ago

given the pain of getting the correct version of nvidia drivers on a system all lined up, i wonder if a docker image of this repo would help. (or alternatively, run this project inside the tf docker image)

https://www.tensorflow.org/install/docker

setting docker up for gpu access requires some extra steps (see step 2 in link above), but was pretty straight forward.

diffractometer commented 4 years ago

I'm having the same issue. I agree a docker would be great. Is NCCL a dep does anyone know?

diffractometer commented 4 years ago

@carchrae I think running a docker image on an EC2 instance, if you don't have CUDA on a Mac, is the way to go right?

carchrae commented 4 years ago

@diffractometer - sounds right.

if you do have a cuda supported nvidia card (but not the cuda libs installed) you could probably still run the docker + gpu extensions locally.

otherwise, if you deploy one of these amis it sounds like you get a gpu-enabled docker pre-installed. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html

diffractometer commented 4 years ago

@carchrae awesome, I'll check that out. My other friend said I should just concentrate on getting it running locally in a jupyter notebook

carchrae commented 4 years ago

i suspect you'd hit the same error in jupyter as it seems the code requires cuda/nccl

https://github.com/openai/jukebox/blob/master/jukebox/utils/dist_utils.py#L42

i guess this is really a bug in documentation (a common one) that the code requires an nvidia gpu. looking at the other issues getting reported, i think you also need a gpu with a lot of ram. i am yet to try it on my card (it has only 6gb ram)

diffractometer commented 4 years ago

Ah, dang. That makes sense, especially given the results. I'll keep poking...

carchrae commented 4 years ago

hmm - or maybe not? and there is a cpu only flag. (i bet it'll be damn slow tho!)

https://github.com/openai/jukebox/blob/b4dc344d0e293b9a713d9cfe15be7c66ba472902/jukebox/utils/dist_utils.py#L77

diffractometer commented 4 years ago

ah, yeah I saw that line and tried to change it, but every time I ran it the error still bubbled up Distributed package doesn't have NCCL built in

carchrae commented 4 years ago

did you get any more useful error from this output?

            print(f"Caught error during NCCL init (attempt {attempt_idx} of {n_attempts}): {e}")

also, love the comment on the next line

            sleep(1 + (0.01 * mpi_rank))  # Sleep to avoid thundering herd

carchrae commented 4 years ago

so, i went through the install steps, and the sample seems to work for me (it is still running/downloading stuff)

my system: ubuntu 18.04, cuda lib installed 10.2.89, gtx 1060 w/ 6gb.

tom@saturn:~/projects/learning/jukebox$ python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Using cuda True
{'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)}
Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128
Downloading from gce
Restored from /home/tom/.cache/jukebox-assets/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Conditioning on 1 above level(s)
Checkpointing convs
Checkpointing convs
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_artist_ids.txt
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_genre_ids.txt
Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536
Downloading from gce

prafullasd commented 4 years ago

The project does require a GPU to run, it could work on CPU but hasn't been tested and will almost surely be very slow.

@maraoz The NCCL error you see is in initialising torch.distributed, which technically isn't needed for sampling but is unfortunately still present in the code. Maybe initialise it with a different backend eg: setup_dist_from_mpi(backend="gloo"), or remove distributed/mpi all together as done here https://github.com/openai/jukebox/issues/36#issuecomment-624084129

diffractometer commented 4 years ago

@prafullasd @carchrae thank you for your input, looks like I need to spend a couple of days working on my tooling before I can attempt a build, so I'm going familiarize myself with notebooks. If there's anyway I can help with a docker build in the meantime, testing at least haha ;) lmk

Jimmiexjames commented 4 years ago

Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I don’t code and I usually don’t pirate so I’m “overemcumbered” by this foggy paranoia about this entire thing.

stevebanik commented 4 years ago

@prafullasd how do you initialize with the gloo backend? Is that option passed to sample.py?

diffractometer commented 4 years ago

@Jimmiexjames I'm having good luck using the colab notebook, at least just getting it running. I ended up using the paid plan to stop timeouts.

btrude commented 4 years ago

I made a jukebox docker image after proving that my local 2080 ti wasn't going to cut it for training. I have only had the opportunity to test it on vast.ai with a 1070 and then 2x V100s but both sampling and training seem to be working. You can spin up a vast instance for less than a dollar an hour and start messing around with it using btrude/jukebox-docker:latest as your image. IMO this is the easiest way to get going with this project from a hobbyist perspective. There are some minor tweaks to be made to the image but overall it works straight out of the box on vast so if anyone tries it please let me know if it is working for you (especially outside of vast).

diffractometer commented 4 years ago

@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.

btrude commented 4 years ago

@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.

https://hub.docker.com/r/btrude/jukebox-docker I also added btrude/jukebox-docker:apex for faster training

perlman-izzy commented 4 years ago

Hi, noob programmer here -- can I run this on a vast.ai server? How?

btrude commented 4 years ago

Hi, noob programmer here -- can I run this on a vast.ai server? How?

Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button

After that you're on your own 🥼

perlman-izzy commented 4 years ago

Thank you so much for your help!

I was able to get that far on my own, but now I'm pretty stuck. Do I enter the code on Terminal on my Mac, with some code which points the computing to Vast.ai (I installed all the libraries except CUDA, since I don't have a GPU, and I get an NCCL erro). Or, do I enter some code on the Vast.Ai server when prompted? Or is it a combination of the two?

I am not necessarily expecting a stranger to just help me out of the blue so I understand if you don't reply. Or even if you want to tell me what to google, that would be helpful.

I'm pretty lost here! Sorry and thank you.

Best

BW

On Sat, May 23, 2020 at 8:24 AM btrude notifications@github.com wrote:

Hi, noob programmer here -- can I run this on a vast.ai server? How?

Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button

After that you're on your own 🥼

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openai/jukebox/issues/9#issuecomment-633075472, or unsubscribe https://github.com/notifications/unsubscribe-auth/APWDYJWIC4CWPXLWLB6LZD3RS7TB7ANCNFSM4MWJGKJQ .

perlman-izzy commented 4 years ago

I should say, I installed the libraries on my Mac. I haven't installed anything on the Vast.ai server yet since I don't know how or even if I'm supposed to.

Just clarifying. Thank you! Best wishes

On Sun, May 24, 2020 at 11:19 PM The Crazy 88s perlman.izzy@gmail.com wrote:

Thank you so much for your help!

I was able to get that far on my own, but now I'm pretty stuck. Do I enter the code on Terminal on my Mac, with some code which points the computing to Vast.ai (I installed all the libraries except CUDA, since I don't have a GPU, and I get an NCCL erro). Or, do I enter some code on the Vast.Ai server when prompted? Or is it a combination of the two?

I am not necessarily expecting a stranger to just help me out of the blue so I understand if you don't reply. Or even if you want to tell me what to google, that would be helpful.

I'm pretty lost here! Sorry and thank you.

Best

BW

On Sat, May 23, 2020 at 8:24 AM btrude notifications@github.com wrote:

Hi, noob programmer here -- can I run this on a vast.ai server? How?

Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button

After that you're on your own 🥼

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openai/jukebox/issues/9#issuecomment-633075472, or unsubscribe https://github.com/notifications/unsubscribe-auth/APWDYJWIC4CWPXLWLB6LZD3RS7TB7ANCNFSM4MWJGKJQ .

btrude commented 4 years ago

I should say, I installed the libraries on my Mac. I haven't installed anything on the Vast.ai server yet since I don't know how or even if I'm supposed to. Just clarifying. Thank you! Best wishes

When using vast.ai or just docker on its own you are using virtual machines that have little or no connection to your local machine. So in this case you don't need to install drivers or any software other than ssh in order to connect to a vast instance and run the code (meaning that you can safely remove nvidia related software from your mac and you already will have ssh as it is built into macos). Also, the entire point of docker images is that you should not have to install anything and can just begin using them immediately after they are loaded (unless you have some specific need like I outline below). Picking up from my instructions above you should do the following:

Follow this guide https://help.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent through to this page https://help.github.com/en/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account but instead of putting the ssh key into your github account put it into the ssh key box on this page: https://vast.ai/console/account/

Create a vast instance (you need 16GB of vram for anything more than 1b_lyrics with n_samples ~9 so go for a p100 or v100 if thats what you care about otherwise a 2080 ti/1080 ti will be cheapest and remember at least 30gb of disk space or you will get errors and nothing will work) and go to https://vast.ai/console/instances/ and wait for it to spin up. Sometimes you have to click the start button quite a few times before it will actually begin (maybe this isn't necessary and it will just do it on its own?). My images should be cached so if you don't see the blue button transition to "Connect" within a few minutes then its most likely broken and you should destroy and start over until you are given the "Connect" option after it says the image has successfully loaded (I bring this up because sometimes the instances fail to load and its not obvious through the ui, this will probably save someone time talking to customer support/wasting credits).

Click "Connect" and a modal will pop up with an ssh command, copy that command into your terminal and type "yes" when prompted and then cd /opt/jukebox/ as vast does not take you to the docker image workdir for whatever reason. You should now be connected to the vast instances inside an instance of tmux. tmux allows your processes to stay running even after you have disconnected from the instance which is potentially important depending on how long you intend to use it for. See https://tmuxcheatsheet.com/ for important tmux commands, or just do ctrl+b, d to detach from the session, then type exit to exit ssh when you are done. Generally when I detach from an instance I just use nvidia-smi in a separate terminal window (you can connect to the instance in multiple windows using the original ssh command as many times as you need) to determine if the process has finished or not (when the gpu utilization has gone to zero), but if you need to reattach, follow the instructions in the cheat sheet.

In order to pass your own dataset, prompt, or original code, or to recover any samples you made you will have to use scp (which should also be built-in to macos). Take the ssh command provided to you by vast, e.g: ssh -p 16090 root@ssh5.vast.ai -L 8080:localhost:8080 and pass the relevant info to scp like:

scp -P 16090 root@ssh5.vast.ai:/opt/jukebox/path/to/file.wav ~/path/on/my/local/mac

So if you wanted to transfer a file from the default example in this repo's readme to your desktop it would look like this:

scp -P 16090 root@ssh5.vast.ai:/opt/jukebox/sample_5b/level_0/item_0.wav ~/Desktop depending on which specific file, or just:

scp -r -P 16090 root@ssh5.vast.ai:/opt/jukebox/sample_5b/ ~/Desktop if you want to transfer an entire directory. You can also go in the opposite direction if you need to send things to the instance like: scp -r -P 16090 ~/Desktop/my_audio_dataset/ root@ssh5.vast.ai:/opt/jukebox/

Anyone just messing with sampling should note that the metadata in sample.py is hard-coded so you may want to install nano apt-get install nano and then nano jukebox/sample.py, then arrow (nano is a command line text editor) down to line 188 and change the defaults to whatever you want (see here: https://github.com/openai/jukebox/tree/master/jukebox/data/ids for the default options; v3=1b, v2=5b). Ctrl + x, y to save and exit nano.

LeapGamer commented 4 years ago

btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.

btrude commented 4 years ago

btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.

After it says "Killed" what does echo $? say? If it's 137 then yeah, you're out of memory and need to pick an instance with more memory. I don't think I've ever had OOM problems though, the only time I ever saw "Killed" was when I didn't allocate enough disk space.

LeapGamer commented 4 years ago

Yes, it was a problem with too little memory. I was able to get it all working by finding an instance with enough memory. Cheers!

cicinwad commented 2 years ago

Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I don’t code and I usually don’t pirate so I’m “overemcumbered” by this foggy paranoia about this entire thing.

I use an iOS, I know.

openai / jukebox

Error when running sampling example #9