sample.py just stops running without any clear error

PiotrMi commented 4 years ago

Tried running in Google Colab.

Notebook link: https://colab.research.google.com/drive/1qvJ2YCaB2LYbERgqe_I9gHaLyFB-E7o6

Tried 5b-lyrics, 1b-lyrics and 1b-lyrics with lower samples or lengths but it just stops.

5b-lyrics python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125

output: Using cuda True {'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)} Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/vqvae.pth.tar 0: Loading vqvae in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_0.pth.tar 0: Loading prior in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:1, Cond downsample:4, Raw to tokens:32, Sample length:262144 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_1.pth.tar 0: Loading prior in eval mode ^C

1b-lyrics:

python jukebox/sample.py --model=1b_lyrics --name=sample_1b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125

output: Using cuda True {'name': 'sample_1b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 16, 'hop_fraction': (0.5, 0.5, 0.125)} Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/vqvae.pth.tar 0: Loading vqvae in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_0.pth.tar 0: Loading prior in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /content/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:1, Cond downsample:4, Raw to tokens:32, Sample length:262144 Downloading from gce Restored from /root/.cache/jukebox-assets/models/5b/prior_level_1.pth.tar 0: Loading prior in eval mode Creating cond. autoregress with prior bins [79, 2048], dims [384, 6144], shift [ 0 79] input shape 6528 input bins 2127 Self copy is False ^C

Mind you I never tried to interrupt the script so I don't know where the ^C is coming from.

maddy023 commented 4 years ago

Due to GPU over excess usage from colab its causing the code to close.

PiotrMi commented 4 years ago

Do you have an idea which parameters I can tweak to reduce the load on the GPU? I tried lower samples and lower sample lengths

BlueProphet commented 4 years ago

I get this seems to be the same issue though. This is with 2 1080 TI on Ubuntu

0: Loading prior in eval mode Conditioning on 1 above level(s) Checkpointing convs Checkpointing convs Loading artist IDs from /home/taj/jukebox/jukebox/data/ids/v2_artist_ids.txt Loading artist IDs from /home/taj/jukebox/jukebox/data/ids/v2_genre_ids.txt Level:1, Cond downsample:4, Raw to tokens:32, Sample length:262144 Downloading from gce Restored from /home/taj/.cache/jukebox-assets/models/5b/prior_level_1.pth.tar 0: Loading prior in eval mode Killed

maddy023 commented 4 years ago

Do you have an idea which parameters I can tweak to reduce the load on the GPU? I tried lower samples and lower sample lengths

reduce the n_samples to 2 - 4 and model: 1b_lyrics it works fine in colab.

ttbrunner commented 4 years ago

I get this seems to be the same issue though. This is with 2 1080 TI on Ubuntu

How much RAM do you have in your system (not GPU)? I had this problem, but managed to run it after I added more swap to my system memory.

My low-end specs: Ubuntu 18, 16GB RAM + 16GB Swap, 1x1070 (8GB). I can run 1b_lyrics with n_samples=4 (barely).

PiotrMi commented 4 years ago

Do you have an idea which parameters I can tweak to reduce the load on the GPU? I tried lower samples and lower sample lengths

reduce the n_samples to 2 - 4 and model: 1b_lyrics it works fine in colab.

Tried this out today. Unfortunately it still stopped the execution. Not sure if many people were using Colab at the moment. I think I had 16 GB RAM available.

ttbrunner commented 4 years ago

Are you running sample.py or jukebox/interacting_with_jukebox.ipynb? The notebook seems to be new and needs less memory.

sample.py needs more system RAM because it loads all three priors at once. The code in the jupyter notebook loads only the lvl2 prior, draws samples, deallocates memory, and only afterwards loads the lvl1 and lvl0 upsamplers.

Also, the code in the notebook has different hyperparams as sample.py, so you might want to play around with sample length, total length, etc. I had to figure this out yesterday.

made-by-chris commented 4 years ago

@ttbrunner please could you share your colab parameters? are you successfully running locally or hosted? I managed to get local set up ( i changed "5b_lyrics" to "1b_lyrics" ) I have an RTX 2060 and 16gb RAM https://ghostbin.co/paste/no6d6

but at this step:

zs = [t.zeros(hps.n_samples,0,dtype=t.long, device='cuda') for _ in range(len(priors))]
zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)

I get the error:

RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 5.76 GiB total capacity; 4.50 GiB already allocated; 12.19 MiB free; 4.82 GiB reserved in total by PyTorch)

ttbrunner commented 4 years ago

@basiclaser Hey, your card has 6GB video RAM, mine has 8GB. That might be the problem. I guess you can try n_samples=1, but if it doesn't work then I guess you can't run it :(

made-by-chris commented 4 years ago

@ttbrunner yeh i gave it a shot.. lots of crashing and then memory allocation errors :) Thanks for your advice though. Are most people running this thing entirely in the cloud? I'm curious about the costs of using it in a hosted way.

stevedipaola commented 4 years ago

Anyone solve this for us with GPU on stand alone Linux PCs - we are running out of mem and crashing off the bat - I have a 1080ti and can't get it working at any n_samples.

ttbrunner commented 4 years ago

Yes, I can run it on Ubuntu 18 with a 1070. To sum up,

Allocate heaps of swap space if you have <32GB system RAM
Use 1b_lyrics (not 5b)
Try n_samples=1
Set a short sample_length, but not too short (any less than 20 seconds might return an error)
Don't use sample.py, but create your own based on the code in the jupyter notebook. The crucial difference is that the top prior (which generates the initial sample) is loaded separately from the two upsamplers that follow. sample.py loads them all in one, which causes OOM.

It's possible to save the intermediate results to disk with torch.save(zs, PATH) and continue later. On a low-end machine, I would create two separate scripts, one for drawing from the top prior and one for the upsampling. Sampling the top prior just takes a couple of minutes, you can already listen to it and decide if you want to upsample it.

terrafying commented 4 years ago

mine crashes as soon as i run rank, local_rank, device = setup_dist_from_mpi()

openai / jukebox

sample.py just stops running without any clear error #19