magenta / magenta-demos

Demonstrations of Magenta Models
Apache License 2.0
1.32k stars 419 forks source link

nsynth generate.py stuck, doesn't use GPU, and generates 31GB of long .wav files #86

Closed MarkTension closed 4 years ago

MarkTension commented 4 years ago

Hi all!

I'm trying to run generate.py

 "instruments": [["cupoo","bush"],
                    ["cu", "kemphur"]],
    "checkpoint_dir":"./wavenet-ckpt",
    "pitches":      [24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84],
    "resolution":              9,
    "final_length":            64000,
    "gpus":                    1,
    "batch_size_embeddings":   32,
    "batch_size_generate":     256,
    "name":         "synth_1"

the soxi output of my wav files looks pretty okay:

Input File     : 'bush_48.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.17 = 66730 samples ~ 312.797 CDDA sectors
File Size      : 134k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

But for some reason the training won't work. I get as far as generating embeddings files, but when interpolating I've been waiting for over 12 hours to generate one batch and cancelled it. I'm running on a NVIDIA geforce GTX 1080.

When checking the working directory I see a lot of wav files generated (31Gb) in the batch0 folder, but they have a length each of ~ 34 minutes. This must be wrong right?

I think the main problem is that I'm getting a lot of tensorflow warnings about how most resources are placed on the CPU. GPU's are not active during training when checking nvidia-smi. I've tried on two computers and both give the same error.

2020-05-05 08:47:40.259450: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
Add: CPU
Const: CPU
RandomUniform: CPU
Sub: CPU
VariableV2: CPU
Mul: CPU
Identity: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  skip_30/biases/Initializer/random_uniform/shape (Const)
  skip_30/biases/Initializer/random_uniform/min (Const)
  skip_30/biases/Initializer/random_uniform/max (Const)
  skip_30/biases/Initializer/random_uniform/RandomUniform (RandomUniform)
  skip_30/biases/Initializer/random_uniform/sub (Sub)
  skip_30/biases/Initializer/random_uniform/mul (Mul)
  skip_30/biases/Initializer/random_uniform (Add)
  skip_30/biases (VariableV2) /device:GPU:0
  skip_30/biases/Assign (Assign) /device:GPU:0
  skip_30/biases/read (Identity) /device:GPU:0
  save/Assign_233 (Assign) /device:GPU:0

Any ideas of what's going wrong? Or ideas on how to debug? Thanks in advance!

jesseengel commented 4 years ago

Hi, it does seem that the reason it would take so long is running on CPU. You might want to confirm that your tensorflow is able to run on the GPU on your machine outside of the script. The long files however is very strange. @JCBrouwer do you have any thoughts on what could be going on?

MarkTension commented 4 years ago

Thanks @jesseengel . It did seem that GPU was available from tf.test.is_gpu_available, but I'll give it a more thorough check by running some model then.

Maybe relevant: There were some errors with generate.py code in general. np.linspace kept throwing errors about the third np.linspace argument having to be an integer (it was a float). (maybe a numpy / python version issue? I use numpy 1.18.4 and python 3.6). Temporarily silenced it by rounding to int with np.int(), but might indicate that the cause of long length files is a similar issue? I tried on ubuntu and windows and both needed that modification

JCBrouwer commented 4 years ago

I was able to reproduce the issue as well, ~but also managed to get it working correctly~.

The issue for me was indeed related to executing on the CPU. To get it working I uninstalled magenta and made sure only magenta-gpu was installed. Also, my tf.test.is_gpu_available() showed an error because cudnn wasn't installed. This is super easy to install if you use anaconda, just a "conda install cudnn" was enough.

~The 30 minute wavs are related to the way the files are slowly being filled in 10000 samples at a time by nsynth_generate. I think it's just the metadata of the half-finished audio being incorrect. Once they're completely done rendering it should show them as the correct length.~

A good way to test whether you're running on the GPU correctly is to run the nsynth_generate command manually (generate.py is running this under the hood for you):

nsynth_generate \
--checkpoint_path=/home/hans//code/magenta/magenta/models/nsynth/wavenet/wavenet-ckpt/model.ckpt-200000 \
--source_path=/home/hans/code/magenta-demos/nsynth/working_dir/embeddings/interp/batch0 \
--save_path=/home/hans/code/magenta-demos/nsynth/working_dir/audio/batch0 \
--sample_length=80000 \
--batch_size=256 \
--log=DEBUG \
--gpu_number=0

(you'll have to edit the paths to match your own) On my 1080 Ti I'm generating around 100 samples per second while on the CPU it was taking a LOT longer. While it's running you can take a look in htop & nvtop to see what's being used.

BTW, I also ran into the float error in the np.linspace and just casting the third argument is fine. I think as long as the second one isn't cast it shouldn't break anything (I believe it needed to be a float to prevent some off by one errors related to audio file placement in multigrids).

JCBrouwer commented 4 years ago

OK scratch the part about the long files. The run I left on overnight is still generating. For some reason it isn't stopping after it has generated sample_length samples. The first 5 seconds (ie 80000 samples) of audio sound correct, but the files are still 43 minutes long.

I can take a better look through generate.py over the next couple days, but I feel like this might be something in nsynth_generate.

Do embeddings of length 156 sound about right for 5 sec of audio @jesseengel ?

MarkTension commented 4 years ago

@JCBrouwer Thank you, I'll make sure to fix my GPU problem first, and for the time being I'll manually end the generation when I hit 4 seconds sample length

jesseengel commented 4 years ago

Thanks for following up on this @JCBrouwer. If I recall correctly embeddings should be every 32ms, so 156 sounds right I think.