Open PiotrMi opened 4 years ago
Due to GPU over excess usage from colab its causing the code to close.
Do you have an idea which parameters I can tweak to reduce the load on the GPU? I tried lower samples and lower sample lengths
I get this seems to be the same issue though. This is with 2 1080 TI on Ubuntu
0: Loading prior in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /home/taj/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /home/taj/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:1, Cond downsample:4, Raw to tokens:32, Sample length:262144 Downloading from gce Restored from /home/taj/.cache/jukebox-assets/models/5b/prior_level_1.pth.tar 0: Loading prior in eval mode Killed
Do you have an idea which parameters I can tweak to reduce the load on the GPU? I tried lower samples and lower sample lengths
reduce the n_samples to 2 - 4 and model: 1b_lyrics it works fine in colab.
I get this seems to be the same issue though. This is with 2 1080 TI on Ubuntu
How much RAM do you have in your system (not GPU)? I had this problem, but managed to run it after I added more swap to my system memory.
My low-end specs: Ubuntu 18, 16GB RAM + 16GB Swap, 1x1070 (8GB). I can run 1b_lyrics with n_samples=4 (barely).
Do you have an idea which parameters I can tweak to reduce the load on the GPU? I tried lower samples and lower sample lengths
reduce the n_samples to 2 - 4 and model: 1b_lyrics it works fine in colab.
Tried this out today. Unfortunately it still stopped the execution. Not sure if many people were using Colab at the moment. I think I had 16 GB RAM available.
Are you running sample.py
or jukebox/interacting_with_jukebox.ipynb
? The notebook seems to be new and needs less memory.
sample.py
needs more system RAM because it loads all three priors at once. The code in the jupyter notebook loads only the lvl2 prior, draws samples, deallocates memory, and only afterwards loads the lvl1 and lvl0 upsamplers.
Also, the code in the notebook has different hyperparams as sample.py, so you might want to play around with sample length, total length, etc. I had to figure this out yesterday.
@ttbrunner please could you share your colab parameters? are you successfully running locally or hosted? I managed to get local set up ( i changed "5b_lyrics" to "1b_lyrics" ) I have an RTX 2060 and 16gb RAM https://ghostbin.co/paste/no6d6
but at this step:
zs = [t.zeros(hps.n_samples,0,dtype=t.long, device='cuda') for _ in range(len(priors))]
zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
I get the error:
RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 5.76 GiB total capacity; 4.50 GiB already allocated; 12.19 MiB free; 4.82 GiB reserved in total by PyTorch)
@basiclaser Hey, your card has 6GB video RAM, mine has 8GB. That might be the problem. I guess you can try n_samples=1, but if it doesn't work then I guess you can't run it :(
@ttbrunner yeh i gave it a shot.. lots of crashing and then memory allocation errors :) Thanks for your advice though. Are most people running this thing entirely in the cloud? I'm curious about the costs of using it in a hosted way.
Anyone solve this for us with GPU on stand alone Linux PCs - we are running out of mem and crashing off the bat - I have a 1080ti and can't get it working at any n_samples.
Yes, I can run it on Ubuntu 18 with a 1070. To sum up,
It's possible to save the intermediate results to disk with torch.save(zs, PATH)
and continue later. On a low-end machine, I would create two separate scripts, one for drawing from the top prior and one for the upsampling. Sampling the top prior just takes a couple of minutes, you can already listen to it and decide if you want to upsample it.
mine crashes as soon as i run rank, local_rank, device = setup_dist_from_mpi()
Tried running in Google Colab.
Notebook link: https://colab.research.google.com/drive/1qvJ2YCaB2LYbERgqe_I9gHaLyFB-E7o6
Tried 5b-lyrics, 1b-lyrics and 1b-lyrics with lower samples or lengths but it just stops.
5b-lyrics
python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
output:
Using cuda True {'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)} Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/vqvae.pth.tar 0: Loading vqvae in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_0.pth.tar 0: Loading prior in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:1, Cond downsample:4, Raw to tokens:32, Sample length:262144 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_1.pth.tar 0: Loading prior in eval mode ^C
1b-lyrics:
python jukebox/sample.py --model=1b_lyrics --name=sample_1b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125
output:
Using cuda True {'name': 'sample_1b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 16, 'hop_fraction': (0.5, 0.5, 0.125)} Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/vqvae.pth.tar 0: Loading vqvae in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_0.pth.tar 0: Loading prior in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:1, Cond downsample:4, Raw to tokens:32, Sample length:262144 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_1.pth.tar 0: Loading prior in eval mode Creating cond. autoregress with prior bins [79, 2048], dims [384, 6144], shift [ 0 79] input shape 6528 input bins 2127 Self copy is False ^C
Mind you I never tried to interrupt the script so I don't know where the ^C is coming from.