magenta / magenta

Magenta: Music and Art Generation with Machine Intelligence
Apache License 2.0
19.2k stars 3.74k forks source link

Cannot go through the guide #1612

Open eviluess opened 5 years ago

eviluess commented 5 years ago

Hello,

I follow the instructions the reach the step "Training my own", and met an error below. plz help me to solve it. Thanks!

https://github.com/tensorflow/magenta/tree/master/magenta/models/melody_rnn

melody_rnn_train \ --config=attention_rnn \ --run_dir=/tmp/melody_rnn/logdir/run1 \ --sequence_example_file=/tmp/melody_rnn/sequence_examples/training_melodies.tfrecord \ --hparams="batch_size=64,rnn_layer_sizes=[64,64]" \ --num_training_steps=20000

ERROR: Traceback (most recent call last): File "/root/anaconda3/bin/melody_rnn_train", line 11, in load_entry_point('magenta', 'console_scripts', 'melody_rnn_train')() File "/eviluess/magenta/magenta/models/melody_rnn/melody_rnn_train.py", line 108, in console_entry_point tf.app.run(main) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/root/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/root/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/eviluess/magenta/magenta/models/melody_rnn/melody_rnn_train.py", line 104, in main checkpoints_to_keep=FLAGS.num_checkpoints) File "/eviluess/magenta/magenta/models/shared/events_rnn_train.py", line 84, in run_training is_chief=task == 0) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/contrib/training/python/training/training.py", line 549, in train loss = session.run(train_op, run_metadata=run_metadata) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 861, in exit self._close_internal(exception_type) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 899, in _close_internal self._sess.close() File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1166, in close self._sess.close() File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1334, in close ignore_live_threads=True) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/root/anaconda3/lib/python3.7/site-packages/six.py", line 692, in reraise raise value.with_traceback(tb) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257, in _run enqueue_callable() File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run self._call_tf_sessionrun(None, {}, [], target_list, None) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Key: inputs, Index: 0. Number of float values != expected. values size: 38 but output shape: [74] [[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]] (base) ➜ ~

falaktheoptimist commented 5 years ago

Looks like you did not set the config correctly when following the steps for sequenceExamples generation there. That could be causing the difference in shapes it is complaining about. Could you paste the command you used to generate the sequenceExample?

eviluess commented 5 years ago

Thanks for your quick reply.

The config is lookback_rnn.

For the detail of what I have done, I'll show you all my steps (from the very beginning to error):

  1. I tested the environment using "Generate a melody":

BUNDLE_PATH=/eviluess/res/lookback_rnn.mag CONFIG=lookback_rnn melody_rnn_generate \ --config=${CONFIG} \ --bundle_file=${BUNDLE_PATH} \ --output_dir=/tmp/melody_rnn/generated \ --num_outputs=10 \ --num_steps=128 \ --primer_melody="[60]"

And it's fine and shows: Wrote 10 MIDI files to /tmp/melody_rnn/generated

  1. To build a dataset, I followed "Building your Dataset" and actually do the following things: 2.1. I downloaded "Building your Dataset" and unpacked it with tar -xf and no error occurs. 2.2. Convert a subset of it to NoteSequences: INPUT_DIRECTORY=/eviluess/res/lmd_matched/A/A SEQUENCES_TFRECORD=/tmp/notesequences.tfrecord

convert_dir_to_note_sequences \ --input_dir=$INPUT_DIRECTORY \ --output_file=$SEQUENCES_TFRECORD \ --recursive

And it ran fine. The last line of the result is: INFO:tensorflow:Converted MIDI file /eviluess/res/lmd_matched/A/A/S/TRAASKZ128F9308820/886270e42b69e0983dfd9a591e66f214.mid. I1102 17:43:05.389676 140640958502656 convert_dir_to_note_sequences.py:133] Converted MIDI file /eviluess/res/lmd_matched/A/A/S/TRAASKZ128F9308820/886270e42b69e0983dfd9a591e66f214.mid.

2.3. Create SequenceExamples -- the step you concert: melody_rnn_create_dataset \ --config=lookback_rnn \ --input=/tmp/notesequences.tfrecord \ --output_dir=/tmp/melody_rnn/sequence_examples \ --eval_ratio=0.10

It succeeded with the following results: INFO:tensorflow:DAGPipeline_TranspositionPipeline_training_transpositions_generated: 780 I1102 17:48:18.414772 139742925997824 statistics.py:141] DAGPipeline_TranspositionPipeline_training_transpositions_generated: 780

And I checked the directory, the files are there indeed.

2.4. Train and Evaluate the Model I ran the command directly:

melody_rnn_train \ --config=attention_rnn \ --run_dir=/tmp/melody_rnn/logdir/run1 \ --sequence_example_file=/tmp/melody_rnn/sequence_examples/training_melodies.tfrecord \ --hparams="batch_size=64,rnn_layer_sizes=[64,64]" \ --num_training_steps=20000

And I got errors like the post.

falaktheoptimist commented 5 years ago

So you cannot use look back_rnn in sequence example generation and attention_rnn in training. Both need to be same as input sizes for both configurations are different. Right now, you're generating input for lookback and trying to pass it into attention rnn. Change one of those as per your choice and it should work.

Example, train command could be changed to melody_rnn_train --config=lookback_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/tmp/melody_rnn/sequence_examples/training_melodies.tfrecord --hparams="batch_size=64,rnn_layer_sizes=[64,64]" --num_training_steps=20000

falaktheoptimist commented 5 years ago

This blog has details on why sizes of both would differ: https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn

eviluess commented 5 years ago

Hey, Sorry for my late response. I don't have the machine that has the environment to run the code.

I'll try your solution asap.

Thank you very much.

eviluess commented 5 years ago

Now I have a new issue that seems to be the hardware configuration, or could you plz give me some advice to fix it, thanks!

File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 725, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in init _WrappedSession.init(self, self._create_session()) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session return self._sess_creator.create_session() File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session self.tf_sess = self._session_creator.create_session() File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session init_fn=self._scaffold.init_fn) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session config=config) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint sess = session.Session(self._target, graph=self._graph, config=config) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1585, in init super(Session, self).init(target, graph, config=config) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 699, in init self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts) tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (3). Valid range is [0, 2]. while setting up XLA_GPU_JIT device number 3

eviluess commented 5 years ago

More logs before the previous post: 2019-11-05 20:25:52.957209: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/anaconda3/lib/: 2019-11-05 20:25:52.960330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-11-05 20:25:52.960357: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2019-11-05 20:25:52.960723: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-11-05 20:25:52.969081: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz 2019-11-05 20:25:52.971085: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c1e97be5f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2019-11-05 20:25:52.971110: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2019-11-05 20:25:52.974750: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal 2019-11-05 20:25:52.975185: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal 2019-11-05 20:25:53.142180: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:7: failed initializing StreamExecutor for CUDA device ordinal 7: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error 2019-11-05 20:25:53.154143: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error 2019-11-05 20:25:53.163928: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:6: failed initializing StreamExecutor for CUDA device ordinal 6: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error 2019-11-05 20:25:54.909159: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c1e93af0c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2019-11-05 20:25:54.909198: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0 2019-11-05 20:25:54.909207: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0 2019-11-05 20:25:54.909215: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla V100-PCIE-16GB, Compute Capability 7.0 Traceback (most recent call last): File "/root/anaconda3/bin/melody_rnn_train", line 11, in load_entry_point('magenta', 'console_scripts', 'melody_rnn_train')() File "/eviluess/magenta/magenta/models/melody_rnn/melody_rnn_train.py", line 108, in console_entry_point

falaktheoptimist commented 5 years ago

This looks like possibly an issue of Cuda version mismatch/ not added to path correctly.

Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...

Certainly points to that. What's the output of tf.test.is_gpu_available()?