Open maraoz opened 4 years ago
given the pain of getting the correct version of nvidia drivers on a system all lined up, i wonder if a docker image of this repo would help. (or alternatively, run this project inside the tf docker image)
https://www.tensorflow.org/install/docker
setting docker up for gpu access requires some extra steps (see step 2 in link above), but was pretty straight forward.
I'm having the same issue. I agree a docker would be great. Is NCCL a dep does anyone know?
@carchrae I think running a docker image on an EC2 instance, if you don't have CUDA on a Mac, is the way to go right?
@diffractometer - sounds right.
if you do have a cuda supported nvidia card (but not the cuda libs installed) you could probably still run the docker + gpu extensions locally.
otherwise, if you deploy one of these amis it sounds like you get a gpu-enabled docker pre-installed. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html
@carchrae awesome, I'll check that out. My other friend said I should just concentrate on getting it running locally in a jupyter notebook
i suspect you'd hit the same error in jupyter as it seems the code requires cuda/nccl
https://github.com/openai/jukebox/blob/master/jukebox/utils/dist_utils.py#L42
i guess this is really a bug in documentation (a common one) that the code requires an nvidia gpu. looking at the other issues getting reported, i think you also need a gpu with a lot of ram. i am yet to try it on my card (it has only 6gb ram)
Ah, dang. That makes sense, especially given the results. I'll keep poking...
hmm - or maybe not? and there is a cpu only flag. (i bet it'll be damn slow tho!)
ah, yeah I saw that line and tried to change it, but every time I ran it the error still bubbled up Distributed package doesn't have NCCL built in
did you get any more useful error from this output?
print(f"Caught error during NCCL init (attempt {attempt_idx} of {n_attempts}): {e}")
also, love the comment on the next line
sleep(1 + (0.01 * mpi_rank)) # Sleep to avoid thundering herd
so, i went through the install steps, and the sample seems to work for me (it is still running/downloading stuff)
my system: ubuntu 18.04, cuda lib installed 10.2.89, gtx 1060 w/ 6gb.
tom@saturn:~/projects/learning/jukebox$ python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Using cuda True
{'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)}
Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128
Downloading from gce
Restored from /home/tom/.cache/jukebox-assets/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Conditioning on 1 above level(s)
Checkpointing convs
Checkpointing convs
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_artist_ids.txt
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_genre_ids.txt
Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536
Downloading from gce
The project does require a GPU to run, it could work on CPU but hasn't been tested and will almost surely be very slow.
@maraoz The NCCL error you see is in initialising torch.distributed, which technically isn't needed for sampling but is unfortunately still present in the code. Maybe initialise it with a different backend eg: setup_dist_from_mpi(backend="gloo"), or remove distributed/mpi all together as done here https://github.com/openai/jukebox/issues/36#issuecomment-624084129
@prafullasd @carchrae thank you for your input, looks like I need to spend a couple of days working on my tooling before I can attempt a build, so I'm going familiarize myself with notebooks. If there's anyway I can help with a docker build in the meantime, testing at least haha ;) lmk
Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I don’t code and I usually don’t pirate so I’m “overemcumbered” by this foggy paranoia about this entire thing.
@prafullasd how do you initialize with the gloo backend? Is that option passed to sample.py?
@Jimmiexjames I'm having good luck using the colab notebook, at least just getting it running. I ended up using the paid plan to stop timeouts.
I made a jukebox docker image after proving that my local 2080 ti wasn't going to cut it for training. I have only had the opportunity to test it on vast.ai with a 1070 and then 2x V100s but both sampling and training seem to be working. You can spin up a vast instance for less than a dollar an hour and start messing around with it using btrude/jukebox-docker:latest
as your image. IMO this is the easiest way to get going with this project from a hobbyist perspective. There are some minor tweaks to be made to the image but overall it works straight out of the box on vast so if anyone tries it please let me know if it is working for you (especially outside of vast).
@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.
@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.
https://hub.docker.com/r/btrude/jukebox-docker I also added btrude/jukebox-docker:apex
for faster training
Hi, noob programmer here -- can I run this on a vast.ai server? How?
Hi, noob programmer here -- can I run this on a vast.ai server? How?
Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button
After that you're on your own 🥼
Thank you so much for your help!
I was able to get that far on my own, but now I'm pretty stuck. Do I enter the code on Terminal on my Mac, with some code which points the computing to Vast.ai (I installed all the libraries except CUDA, since I don't have a GPU, and I get an NCCL erro). Or, do I enter some code on the Vast.Ai server when prompted? Or is it a combination of the two?
I am not necessarily expecting a stranger to just help me out of the blue so I understand if you don't reply. Or even if you want to tell me what to google, that would be helpful.
I'm pretty lost here! Sorry and thank you.
Best
BW
On Sat, May 23, 2020 at 8:24 AM btrude notifications@github.com wrote:
Hi, noob programmer here -- can I run this on a vast.ai server? How?
Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button
After that you're on your own 🥼
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openai/jukebox/issues/9#issuecomment-633075472, or unsubscribe https://github.com/notifications/unsubscribe-auth/APWDYJWIC4CWPXLWLB6LZD3RS7TB7ANCNFSM4MWJGKJQ .
I should say, I installed the libraries on my Mac. I haven't installed anything on the Vast.ai server yet since I don't know how or even if I'm supposed to.
Just clarifying. Thank you! Best wishes
On Sun, May 24, 2020 at 11:19 PM The Crazy 88s perlman.izzy@gmail.com wrote:
Thank you so much for your help!
I was able to get that far on my own, but now I'm pretty stuck. Do I enter the code on Terminal on my Mac, with some code which points the computing to Vast.ai (I installed all the libraries except CUDA, since I don't have a GPU, and I get an NCCL erro). Or, do I enter some code on the Vast.Ai server when prompted? Or is it a combination of the two?
I am not necessarily expecting a stranger to just help me out of the blue so I understand if you don't reply. Or even if you want to tell me what to google, that would be helpful.
I'm pretty lost here! Sorry and thank you.
Best
BW
On Sat, May 23, 2020 at 8:24 AM btrude notifications@github.com wrote:
Hi, noob programmer here -- can I run this on a vast.ai server? How?
Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button
After that you're on your own 🥼
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openai/jukebox/issues/9#issuecomment-633075472, or unsubscribe https://github.com/notifications/unsubscribe-auth/APWDYJWIC4CWPXLWLB6LZD3RS7TB7ANCNFSM4MWJGKJQ .
I should say, I installed the libraries on my Mac. I haven't installed anything on the Vast.ai server yet since I don't know how or even if I'm supposed to. Just clarifying. Thank you! Best wishes
When using vast.ai or just docker on its own you are using virtual machines that have little or no connection to your local machine. So in this case you don't need to install drivers or any software other than ssh
in order to connect to a vast instance and run the code (meaning that you can safely remove nvidia related software from your mac and you already will have ssh as it is built into macos). Also, the entire point of docker images is that you should not have to install anything and can just begin using them immediately after they are loaded (unless you have some specific need like I outline below). Picking up from my instructions above you should do the following:
Follow this guide https://help.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent through to this page https://help.github.com/en/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account but instead of putting the ssh key into your github account put it into the ssh key box on this page: https://vast.ai/console/account/
Create a vast instance (you need 16GB of vram for anything more than 1b_lyrics with n_samples ~9 so go for a p100 or v100 if thats what you care about otherwise a 2080 ti/1080 ti will be cheapest and remember at least 30gb of disk space or you will get errors and nothing will work) and go to https://vast.ai/console/instances/ and wait for it to spin up. Sometimes you have to click the start button quite a few times before it will actually begin (maybe this isn't necessary and it will just do it on its own?). My images should be cached so if you don't see the blue button transition to "Connect" within a few minutes then its most likely broken and you should destroy and start over until you are given the "Connect" option after it says the image has successfully loaded (I bring this up because sometimes the instances fail to load and its not obvious through the ui, this will probably save someone time talking to customer support/wasting credits).
Click "Connect" and a modal will pop up with an ssh command, copy that command into your terminal and type "yes" when prompted and then cd /opt/jukebox/
as vast does not take you to the docker image workdir for whatever reason. You should now be connected to the vast instances inside an instance of tmux. tmux allows your processes to stay running even after you have disconnected from the instance which is potentially important depending on how long you intend to use it for. See https://tmuxcheatsheet.com/ for important tmux commands, or just do ctrl+b, d to detach from the session, then type exit
to exit ssh when you are done. Generally when I detach from an instance I just use nvidia-smi
in a separate terminal window (you can connect to the instance in multiple windows using the original ssh command as many times as you need) to determine if the process has finished or not (when the gpu utilization has gone to zero), but if you need to reattach, follow the instructions in the cheat sheet.
In order to pass your own dataset, prompt, or original code, or to recover any samples you made you will have to use scp
(which should also be built-in to macos). Take the ssh command provided to you by vast, e.g: ssh -p 16090 root@ssh5.vast.ai -L 8080:localhost:8080
and pass the relevant info to scp like:
scp -P 16090 root@ssh5.vast.ai:/opt/jukebox/path/to/file.wav ~/path/on/my/local/mac
So if you wanted to transfer a file from the default example in this repo's readme to your desktop it would look like this:
scp -P 16090 root@ssh5.vast.ai:/opt/jukebox/sample_5b/level_0/item_0.wav ~/Desktop
depending on which specific file, or just:
scp -r -P 16090 root@ssh5.vast.ai:/opt/jukebox/sample_5b/ ~/Desktop
if you want to transfer an entire directory. You can also go in the opposite direction if you need to send things to the instance like:
scp -r -P 16090 ~/Desktop/my_audio_dataset/ root@ssh5.vast.ai:/opt/jukebox/
Anyone just messing with sampling should note that the metadata in sample.py is hard-coded so you may want to install nano apt-get install nano
and then nano jukebox/sample.py
, then arrow (nano is a command line text editor) down to line 188 and change the defaults to whatever you want (see here: https://github.com/openai/jukebox/tree/master/jukebox/data/ids for the default options; v3=1b, v2=5b). Ctrl + x, y to save and exit nano.
btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.
btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.
After it says "Killed" what does echo $?
say? If it's 137
then yeah, you're out of memory and need to pick an instance with more memory. I don't think I've ever had OOM problems though, the only time I ever saw "Killed" was when I didn't allocate enough disk space.
Yes, it was a problem with too little memory. I was able to get it all working by finding an instance with enough memory. Cheers!
Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I don’t code and I usually don’t pirate so I’m “overemcumbered” by this foggy paranoia about this entire thing.
I use an iOS, I know.
I followed the Install instructions and then ran the sampling command and got:
I tried googling around about this NCCL error (I have no idea what NCCL is), but couldn't find any solutions. Any idea on how to fix this? Thanks!