rncm-prism / prism-samplernn

Neural sound synthesis with TensorFlow 2
MIT License
114 stars 20 forks source link

python interpreter state is not initialized #18

Closed abstractMetaContext closed 2 years ago

abstractMetaContext commented 3 years ago

I am getting this error after generating a sound of the current epoch:

W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

i am having difficulty troubleshooting this myself.

relativeflux commented 3 years ago

Hi there, what branch was this on? That looks like it might be caused by a bug in the fix/generated branch - I have a fix for that which I haven't published yet but I can do so today.

Although quickly googling it suggests it's TensorFlow-related...

abstractMetaContext commented 3 years ago

i believe it is on the master branch

abstractMetaContext commented 3 years ago

yes i am led to believe it is indeed tensorflow/keras related. however some fixes were mentioned that is beyond my understanding of the codebase.

relativeflux commented 3 years ago

OK, are you running this on Colab? If not what is your TF version, and what GPU?

relativeflux commented 3 years ago

Might be GPU RAM related.

abstractMetaContext commented 3 years ago

No i am running this on my own rig. GPU is GeForce RTX 2060. TF is the one your script installs into Conda

abstractMetaContext commented 3 years ago

not 100% sure about TF since i was fiddling about to get CUDA working.

relativeflux commented 3 years ago

OK, it should just install the latest TF, which is 2.5 I believe. What is your batch size, and also setting for dim in the config json file? And also number of rnn layers?

abstractMetaContext commented 3 years ago

Batch size is 64. dim is 1064. RNN layers is 4.

abstractMetaContext commented 3 years ago

Also just to be 100% clear. The error is when generating sound example during the running of the train.py script.

relativeflux commented 3 years ago

Did you mean 1024 or 1064 for dimensionality? I'd certainly stick with 1024.

abstractMetaContext commented 3 years ago

1024, typo my bad

abstractMetaContext commented 3 years ago

I have it generating a sound of 5 seconds after every epoch. This is the only params I have changed from your defaults. I dont know if that is relevant?

relativeflux commented 3 years ago

OK, thanks for clarifying. What is the sample rate of the dataset, and what do you have as the sample rate for generated output? I doubt sample rate setting would be the cause, though.

relativeflux commented 3 years ago

You could try running it again with the following lines at the top of the train.py script commented out:

os.environ["KMP_AFFINITY"] = "noverbose" <==
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' <==

import tensorflow as tf
tf.get_logger().setLevel('ERROR') <==
tf.autograph.set_verbosity(3) <==

Comment out the lines pointed to by the arrows. It switches off TensorFlow logging and errors, although not more general Python errors (the scripts need better logging). Perhaps that will tell us something.

abstractMetaContext commented 3 years ago

i allready did this:

Epoch: 65/100, Total Steps: 375, Loss: 3.075, Accuracy: 23.603, Val Loss: 6.028, Val Accuracy: 7.534 (5 min 45.650 sec) 2021-06-16 16:23:33.807538: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]] 2021-06-16 16:23:34.382642: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

that is the final traceback.

2021-06-16 15:33:12.648294: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll 2021-06-16 15:33:15.999735: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll 2021-06-16 15:33:16.028416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:02:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5 coreClock: 1.755GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s 2021-06-16 15:33:16.028573: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll 2021-06-16 15:33:16.039857: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll 2021-06-16 15:33:16.039984: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll 2021-06-16 15:33:16.043212: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll 2021-06-16 15:33:16.044589: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll 2021-06-16 15:33:16.047156: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll 2021-06-16 15:33:16.050340: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll 2021-06-16 15:33:16.051338: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll 2021-06-16 15:33:16.052552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-06-16 15:33:16.053117: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-06-16 15:33:16.053513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:02:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5 coreClock: 1.755GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s 2021-06-16 15:33:16.053585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-06-16 15:33:16.514842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-06-16 15:33:16.514977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2021-06-16 15:33:16.515735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2021-06-16 15:33:16.515951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3961 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:02:00.0, compute capability: 7.5) 2021-06-16 15:33:16.813201: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing. 2021-06-16 15:33:16.813330: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started. 2021-06-16 15:33:16.814033: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs 2021-06-16 15:33:16.814927: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti64_112.dll'; dlerror: cupti64_112.dll not found 2021-06-16 15:33:16.815736: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti.dll'; dlerror: cupti.dll not found 2021-06-16 15:33:16.815851: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1661] function cuptiinterface->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found. 2021-06-16 15:33:16.815971: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down. 2021-06-16 15:33:16.816054: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1752] function cuptiinterface->Finalize()failed with error CUPTI could not be loaded or symbol could not be found. 2021-06-16 15:33:16.825143: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll 2021-06-16 15:33:17.384904: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll 2021-06-16 15:33:18.237943: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll 2021-06-16 15:33:18.737955: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8201 2021-06-16 15:33:27.860653: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) Corpus length: 192 files.

that is the library loader messages at the beginning of the execution.

abstractMetaContext commented 3 years ago

some more logs:

2021-06-16 15:33:43.059888: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing. 2021-06-16 15:33:43.060008: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started. 2021-06-16 15:33:43.060700: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1661] function cuptiinterface->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found. Epoch: 57/100, Step: 2/375, Loss: 2.472, Accuracy: 35.133, (0.748 sec/step) 2021-06-16 15:33:43.810836: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data. 2021-06-16 15:33:43.811027: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1752] function cuptiinterface->Finalize()failed with error CUPTI could not be loaded or symbol could not be found. 2021-06-16 15:33:43.818109: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673] GpuTracer has collected 0 callback api events and 0 activity events. 2021-06-16 15:33:43.820746: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down. 2021-06-16 15:33:43.825934: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43 2021-06-16 15:33:43.828514: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for trace.json.gz to logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43\JORMUNDGANDR.trace.json.gz 2021-06-16 15:33:43.864496: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43 2021-06-16 15:33:43.871302: I tensorflow/core/profiler/rpc/client/save_profile.cc:143] Dumped gzipped tool data for memory_profile.json.gz to logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43\JORMUNDGANDR.memory_profile.json.gz 2021-06-16 15:33:43.881919: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43Dumped tool data for xplane.pb to logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43\JORMUNDGANDR.xplane.pb Dumped tool data for overview_page.pb to logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43\JORMUNDGANDR.overview_page.pb Dumped tool data for input_pipeline.pb to logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43\JORMUNDGANDR.input_pipeline.pb Dumped tool data for tensorflow_stats.pb to logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43\JORMUNDGANDR.tensorflow_stats.pb Dumped tool data for kernel_stats.pb to logdir\t5/16.06.2021_15.33.15\train\plugins\profile\2021_06_16_13_33_43\JORMUNDGANDR.kernel_stats.pb

abstractMetaContext commented 3 years ago

"OK, thanks for clarifying. What is the sample rate of the dataset, and what do you have as the sample rate for generated output? I doubt sample rate setting would be the cause, though."

16000 and 16000.

relativeflux commented 3 years ago

Thanks. OK, those logs look like what I'd expect.

abstractMetaContext commented 3 years ago

My understanding is that its a Tensorflow or Keras memory leak. at least from reading on the generator_dataset_op.cc:103 error. However we are getting a generator_dataset_op.cc:107 error, so I am not completely sure. Some fixes mentioned elsewhere on the net involves setting "workers" to 1 and downgrading TF to a earlier version, not sure if you can make sense of that.

One last thing, the audio is generated and plays, so that is nice. Its annoying to have to keep manually retriggering the training is all. Especially with a RNN that takes so long to condition.

abstractMetaContext commented 3 years ago

Oh i just got the error when just training, NOT when generating.

Epoch: 65/100, Total Steps: 375, Loss: 3.075, Accuracy: 23.603, Val Loss: 6.028, Val Accuracy: 7.534 (5 min 45.650 sec) 2021-06-16 16:23:33.807538: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]] 2021-06-16 16:23:34.382642: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

the error appeared after 9 epochs.

abstractMetaContext commented 3 years ago

The error happens reliably at step 375/375. So its when finalizing the Epoch it gets into trouble.

relativeflux commented 3 years ago

My understanding is that its a Tensorflow or Keras memory leak. at least from reading on the generator_dataset_op.cc:103 error. However we are getting a generator_dataset_op.cc:107 error, so I am not completely sure. Some fixes mentioned elsewhere on the net involves setting "workers" to 1 and downgrading TF to a earlier version, not sure if you can make sense of that.

Yes, it looks like that. I'll look into it. I'll run it with the latest TF and nightly build later.

One last thing, the audio is generated and plays, so that is nice. Its annoying to have to keep manually retriggering the training is all. Especially with a RNN that takes so long to condition.

What I tend to do is actually swicth in-training generation off, and watch TesnorBoard. Then, when I see a nice loss/val_loss I'll stop training and generate some material manually, using the generate.py script. You can always resume training from the same place, perhaps with adjusted learning rate.

By the way, have you tried the generate.py script to see if you get the error with that?

relativeflux commented 3 years ago

Also your val loss and val accuracy are not great, I generally get better results with the following params:

{
    "seq_len": 512,
    "frame_sizes": [2,8],
    "dim": 1024,
    "rnn_type": "gru",
    "num_rnn_layers":  1,
    "q_type": "mu-law",
    "q_levels": 256,
    "emb_size": 256
}

Next release with have those params as defaults. Also temperature of 0.95 (current default is 0.75, which is too low I think).

abstractMetaContext commented 3 years ago

What I tend to do is actually swicth in-training generation off, and watch TesnorBoard

Could i tempt you to writeup a small guide on how to use TensorBoard with your provided repository? That would be an immense help!

By the way, have you tried the generate.py script to see if you get the error with that?

I tried starting a training session earlier with checkpoints every 10 epochs and i got the error when at step375/375 in the 9th epoch. So I fear the error is in the training, not in the generation.

Also your val loss and val accuracy are not great, I generally get better results with the following params

Will try this, thx. And thx for maintaining the repository and being helpfull. Much appreciated!

abstractMetaContext commented 3 years ago

Started from scratch with your new model JSON params. its training faster now, will let you know if/when problems arise.

relativeflux commented 3 years ago

Could i tempt you to writeup a small guide on how to use TensorBoard with your provided repository? That would be an immense help!

Sure no problem, although it's very easy, just run tensorboard with:

tensorboard --logdir path/to/training/logdir/timestamp_dir

where path/to/training/logdir/timestamp_dir points to the timestamped directory containing checkpoints and the train and validation directories (these latter directories are created by tensorboard). The name of directory will be a timestamp of the form:

26.07.2020_20.48.51

You can also just point tensorboard to the supplied logdir directory at the top level of the repo. You'll then see all your log directories (from all trainings, for all projects) listed in tensorboard, which can be helpful.

I tried starting a training session earlier with checkpoints every 10 epochs and i got the error when at step375/375 in the 9th epoch. So I fear the error is in the training, not in the generation.

OK, it looks like that's the case - I'll investigate more tomorrow. Will update here.

And thx for maintaining the repository and being helpfull. Much appreciated!

Thanks, not a problem at all.

abstractMetaContext commented 3 years ago

I have been running the training with checkpoints every 5 epochs and it reached 30. Now it keeps crashing before reaching 35. So changing model params did not help the error. And it really does smell like a memory leak.

relativeflux commented 3 years ago

Thanks for the update. A few additional questions, if that's OK. What platform are you on (Windows, Linux)? Also could you tell me the version of TensorFlow you have - you can find out by starting Python, then execute:

import tensorflow as tf
tf.__version__

That should print the version you have. I assume you are using Anaconda, or an equivalent virtual environment?

abstractMetaContext commented 3 years ago

Yeah sure no problem. I am on Windows and Anaconda.

Running tf.version gets me this:

2021-06-18 21:46:56.011229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll

And nothing else

abstractMetaContext commented 3 years ago

Oh my bad, i just figured it out. Print (tf.__version) nets me:

2.5.0

relativeflux commented 3 years ago

Sorry, haven't had a chance to test this with 2.5.0 yet, both my machines are running production jobs that I can't interrupt. I'll have a chance early next week.

I'd suggest downgrading to TensorFlow 2.4, but I don't think that's available through Anaconda any more. BTW, did you try running on the Colab notebook? That's also TF 2.5.0 (I don't think you can downgrade to a minor version). But if your training setup works there that would give us a clue. It's working fine for me, so that would seem to indicate it's not a TF 2.5 issue.

Just a thought, did you use the chunking script to create your dataset? I know that when someone used Audacity to create the chunks some were not of the same duration as the others, and the training would randomly choke, although I don't think it was that message.