openai / jukebox

Code for the paper "Jukebox: A Generative Model for Music"
https://openai.com/blog/jukebox/
Other
7.76k stars 1.4k forks source link

Reuse pre-trained VQ-VAE throws runtimeError #99

Open ObscuraDK opened 4 years ago

ObscuraDK commented 4 years ago

Hi there. I am trying to reuse the VQ-VAE, with 118 44.1khz 16bit audio files on a 1080 TI.

executing this: mpiexec -n 1 python jukebox/train.py --hps=vqvae,small_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior --sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir='/home/vertigo/jukebox/learning' --labels=False --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000

Get this: Using cuda True 0: Found 118 files. Getting durations 0: self.sr=44100, min: 24, max: inf 0: Keeping 118 of 118 files {'l2': 0.010829454215347525, 'l1': 0.07307693362236023, 'spec': 4325.51904296875} Creating Data Loader 0: Train 859 samples. Test 96 samples 0: Train sampler: <torch.utils.data.distributed.DistributedSampler object at 0x7f29043f8110> 0: Train loader: 214 Downloading from gce Restored from /home/vertigo/.cache/jukebox-assets/models/5b/vqvae.pth.tar 0: Loading vqvae in eval mode Parameters VQVAE:0 Level:2, Cond downsample:None, Raw to tokens:128, Sample length:1048576 0: Converting to fp16 params 0: Loading prior in train mode Parameters Prior:161862656 {'dynamic': True, 'loss_scale': 65536.0, 'max_loss_scale': 16777216.0, 'scale_factor': 1.0027764359010778, 'scale_window': 1, 'unskipped': 0, 'overflow': False} Using CPU EMA Logging to logs/pretrained_vqvae_small_prior 0/214 [00:08<?, ?it/s] Traceback (most recent call last): File "jukebox/train.py", line 341, in fire.Fire(run) File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, *kwargs) File "jukebox/train.py", line 325, in run train_metrics = train(distributed_model, model, opt, shd, scalar, ema, logger, metrics, data_processor, hps) File "jukebox/train.py", line 241, in train opt.step(scale=clipped_grad_scale(grad_norm, hps.clip, scale)) File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 213, in step group["weight_decay"], File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 29, in adamstep p.add(exp_avg/denom + weight_decayp.float(), alpha=-step_size) RuntimeError: output with backend CUDA and dtype Half doesn't match the desired backend CUDA and dtype Float

prafullasd commented 4 years ago

Can you print the data types of all variables in this step as well as input to the function adam_step?

ObscuraDK commented 4 years ago

Hi prafullasd.

Where do I find the step that you would like to see variables from? The variables from the adam_step (line 29 in fp16.py), are pasted in below:

A sidenote: I tried to run without the 'all_fp16' parameter and got a little bit further.

Thank you for your help.

exp_avg: tensor([[-1.2535e-05, -7.5248e-07, -1.6771e-05, ..., 8.8499e-06, -5.9975e-06, -9.2271e-06]], device='cuda:0') denom tensor([[3.9739e-06, 2.4796e-07, 5.3135e-06, ..., 2.8086e-06, 1.9066e-06, 2.9279e-06]], device='cuda:0')

weight_decay*p.float(): tensor([[ 1.2296e-05, -7.0621e-05, -4.8056e-05, ..., -1.3472e-04, 8.2957e-06, 6.4212e-05]], device='cuda:0')

step_size: 0.0 exp_avg: tensor([[ 3.3316e-06, -6.2995e-06, 4.2114e-07, ..., -1.0788e-05, 2.5579e-06, -9.8463e-06], [-2.7420e-06, 2.7076e-06, -4.6100e-06, ..., 1.7860e-06, 7.7512e-08, 1.2060e-06], [ 9.8920e-07, -7.2325e-06, 5.6269e-06, ..., -5.2558e-06, 1.7770e-06, -1.1786e-05], ..., [-4.0166e-06, 3.6385e-06, -6.9851e-06, ..., 3.4009e-06, -7.2514e-07, 1.7762e-06], [-2.6218e-06, 2.4748e-06, -4.4383e-06, ..., 1.9369e-06, -8.6355e-09, 1.2277e-06], [-4.2350e-06, 5.1437e-08, -4.2196e-06, ..., -2.2541e-07, -1.8059e-06, -3.9884e-07]], device='cuda:0') denom tensor([[1.0635e-06, 2.0021e-06, 1.4318e-07, ..., 3.4214e-06, 8.1887e-07, 3.1237e-06], [8.7708e-07, 8.6623e-07, 1.4678e-06, ..., 5.7479e-07, 3.4512e-08, 3.9138e-07], [3.2281e-07, 2.2971e-06, 1.7894e-06, ..., 1.6720e-06, 5.7193e-07, 3.7370e-06], ..., [1.2802e-06, 1.1606e-06, 2.2189e-06, ..., 1.0855e-06, 2.3931e-07, 5.7167e-07], [8.3908e-07, 7.9259e-07, 1.4135e-06, ..., 6.2250e-07, 1.2731e-08, 3.9825e-07], [1.3492e-06, 2.6266e-08, 1.3444e-06, ..., 8.1282e-08, 5.8107e-07, 1.3613e-07]], device='cuda:0')

weight_decay*p.float(): tensor([[ 4.8816e-05, 5.4078e-05, 1.6971e-04, ..., -4.7752e-05, -1.0746e-04, 1.4871e-04], [-1.7761e-04, 1.6470e-05, 6.5703e-05, ..., -1.8517e-04, 1.9216e-04, -1.2899e-04], [-4.5714e-05, -2.5560e-04, 9.7563e-05, ..., 1.4601e-04, -3.1741e-05, -2.3758e-04], ..., [ 2.4744e-04, -1.1324e-04, 2.1962e-05, ..., -1.7920e-04, 1.2483e-04, -1.2064e-05], [ 5.4863e-05, -1.9226e-05, 6.8406e-05, ..., -8.6350e-05, -1.7733e-04, 2.1887e-04], [ 2.1376e-04, -6.3089e-05, -1.3905e-04, ..., -3.5342e-05, -2.2315e-05, -9.9548e-05]], device='cuda:0')

step_size: 0.0 exp_avg: tensor([[-1.2535e-05, -7.5248e-07, -1.6771e-05, ..., 8.8499e-06, -5.9975e-06, -9.2271e-06], [ 9.2020e-07, 4.9647e-07, -1.0082e-05, ..., 1.0352e-06, 5.9144e-07, -8.2584e-06], [ 5.7311e-07, 1.9648e-06, -8.5980e-06, ..., 6.2198e-07, 7.5082e-07, -1.0802e-05], ..., [-1.9820e-07, -6.6953e-08, -6.9843e-07, ..., -1.0760e-08, -2.1478e-07, -1.3698e-07], [ 1.2750e-07, -1.2141e-07, -4.7240e-07, ..., 2.8985e-07, -5.4958e-07, -2.6231e-07], [-1.2559e-07, -4.0557e-07, 2.3975e-07, ..., 3.0800e-07, 7.6356e-07, 3.8858e-07]], device='cuda:0') denom tensor([[3.9739e-06, 2.4796e-07, 5.3135e-06, ..., 2.8086e-06, 1.9066e-06, 2.9279e-06], [3.0099e-07, 1.6700e-07, 3.1981e-06, ..., 3.3734e-07, 1.9703e-07, 2.6215e-06], [1.9123e-07, 6.3132e-07, 2.7289e-06, ..., 2.0669e-07, 2.4743e-07, 3.4260e-06], ..., [7.2678e-08, 3.1172e-08, 2.3086e-07, ..., 1.3403e-08, 7.7919e-08, 5.3316e-08], [5.0320e-08, 4.8394e-08, 1.5939e-07, ..., 1.0166e-07, 1.8379e-07, 9.2950e-08], [4.9716e-08, 1.3825e-07, 8.5815e-08, ..., 1.0740e-07, 2.5146e-07, 1.3288e-07]], device='cuda:0')

weight_decay*p.float(): tensor([[ 7.7831e-06, -7.0554e-05, 5.0016e-06, ..., 1.0374e-04, 1.8247e-04, 2.1118e-04], [-1.0152e-04, 1.1799e-05, 1.6475e-05, ..., -1.5426e-05, 3.7702e-05, 8.4314e-05], [ 1.5354e-04, 2.9751e-05, 4.2645e-05, ..., 3.1049e-05, 5.8405e-05, 2.2491e-05], ..., [-5.2624e-05, -1.5191e-04, -5.7607e-05, ..., -1.5991e-06, -3.6916e-05, -7.8471e-05], [ 4.3725e-05, 4.3821e-05, 2.3359e-05, ..., 2.9689e-05, -5.9945e-05, 1.8269e-04], [-8.5394e-05, 3.6782e-05, -2.0110e-05, ..., -1.5885e-05, -1.1521e-05, 4.9064e-05]], device='cuda:0')

step_size: 0.0 exp_avg: tensor([[ 2.9516e-07, -5.2174e-07, 9.7740e-07, ..., -3.3729e-05, -2.4057e-06, 3.5328e-05], [ 2.9855e-06, 1.1829e-06, -3.5897e-06, ..., 3.8940e-06, 3.0810e-06, -1.3207e-04], [-9.1298e-07, -9.2131e-07, 2.4368e-06, ..., -9.2853e-06, 8.0324e-06, 4.7875e-05], ..., [ 1.1707e-06, 7.9413e-07, -2.2969e-06, ..., -5.0602e-06, 6.1309e-06, -4.5671e-05], [-1.1185e-06, -5.1674e-07, 2.8389e-06, ..., -2.6390e-05, 8.6911e-07, 6.6641e-05], [ 6.2823e-08, -7.5040e-08, -6.5752e-07, ..., -1.1183e-07, -9.9606e-06, 2.5981e-05]], device='cuda:0') denom tensor([[1.0334e-07, 1.7499e-07, 3.1908e-07, ..., 1.0676e-05, 7.7076e-07, 1.1182e-05], [9.5410e-07, 3.8406e-07, 1.1452e-06, ..., 1.2414e-06, 9.8430e-07, 4.1775e-05], [2.9871e-07, 3.0134e-07, 7.8059e-07, ..., 2.9463e-06, 2.5501e-06, 1.5149e-05], ..., [3.8019e-07, 2.6113e-07, 7.3634e-07, ..., 1.6102e-06, 1.9488e-06, 1.4452e-05], [3.6369e-07, 1.7341e-07, 9.0774e-07, ..., 8.3552e-06, 2.8484e-07, 2.1084e-05], [2.9866e-08, 3.3730e-08, 2.1793e-07, ..., 4.5364e-08, 3.1598e-06, 8.2259e-06]], device='cuda:0')

weight_decay*p.float(): tensor([[ 2.3758e-04, 1.2268e-04, -1.3771e-04, ..., -2.8629e-05, 4.6849e-06, 9.8801e-05], [-1.5091e-04, -2.2690e-04, -5.1804e-05, ..., 2.1317e-04, 1.9440e-04, -1.8677e-04], [-4.9248e-05, 1.4137e-04, 9.7418e-06, ..., 8.4877e-06, 9.6817e-05, 5.9166e-05], ..., [ 2.5085e-04, -4.0321e-05, -1.9638e-04, ..., 2.2888e-06, 9.4833e-05, 6.6032e-05], [ 1.0468e-04, -2.8305e-04, -1.6296e-04, ..., 2.0859e-04, -9.9716e-05, -6.4049e-05], [-1.3451e-04, -1.6785e-04, 4.7226e-05, ..., 7.4501e-05, -8.6288e-05, 1.0010e-04]], device='cuda:0') step_size: 0.0

worosom commented 3 years ago

I was also running into this issue. Then I installed apex and it all works now. Seems to me like the "standard" implementation of the adam_step function doesn't work with fp16 models.