Open eviluess opened 5 years ago
Looks like you did not set the config correctly when following the steps for sequenceExamples generation there. That could be causing the difference in shapes it is complaining about. Could you paste the command you used to generate the sequenceExample?
Thanks for your quick reply.
The config is lookback_rnn.
For the detail of what I have done, I'll show you all my steps (from the very beginning to error):
BUNDLE_PATH=/eviluess/res/lookback_rnn.mag CONFIG=lookback_rnn melody_rnn_generate \ --config=${CONFIG} \ --bundle_file=${BUNDLE_PATH} \ --output_dir=/tmp/melody_rnn/generated \ --num_outputs=10 \ --num_steps=128 \ --primer_melody="[60]"
And it's fine and shows: Wrote 10 MIDI files to /tmp/melody_rnn/generated
convert_dir_to_note_sequences \ --input_dir=$INPUT_DIRECTORY \ --output_file=$SEQUENCES_TFRECORD \ --recursive
And it ran fine. The last line of the result is: INFO:tensorflow:Converted MIDI file /eviluess/res/lmd_matched/A/A/S/TRAASKZ128F9308820/886270e42b69e0983dfd9a591e66f214.mid. I1102 17:43:05.389676 140640958502656 convert_dir_to_note_sequences.py:133] Converted MIDI file /eviluess/res/lmd_matched/A/A/S/TRAASKZ128F9308820/886270e42b69e0983dfd9a591e66f214.mid.
2.3. Create SequenceExamples -- the step you concert: melody_rnn_create_dataset \ --config=lookback_rnn \ --input=/tmp/notesequences.tfrecord \ --output_dir=/tmp/melody_rnn/sequence_examples \ --eval_ratio=0.10
It succeeded with the following results: INFO:tensorflow:DAGPipeline_TranspositionPipeline_training_transpositions_generated: 780 I1102 17:48:18.414772 139742925997824 statistics.py:141] DAGPipeline_TranspositionPipeline_training_transpositions_generated: 780
And I checked the directory, the files are there indeed.
2.4. Train and Evaluate the Model I ran the command directly:
melody_rnn_train \ --config=attention_rnn \ --run_dir=/tmp/melody_rnn/logdir/run1 \ --sequence_example_file=/tmp/melody_rnn/sequence_examples/training_melodies.tfrecord \ --hparams="batch_size=64,rnn_layer_sizes=[64,64]" \ --num_training_steps=20000
And I got errors like the post.
So you cannot use look back_rnn in sequence example generation and attention_rnn in training. Both need to be same as input sizes for both configurations are different. Right now, you're generating input for lookback and trying to pass it into attention rnn. Change one of those as per your choice and it should work.
Example, train command could be changed to melody_rnn_train --config=lookback_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/tmp/melody_rnn/sequence_examples/training_melodies.tfrecord --hparams="batch_size=64,rnn_layer_sizes=[64,64]" --num_training_steps=20000
This blog has details on why sizes of both would differ: https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn
Hey, Sorry for my late response. I don't have the machine that has the environment to run the code.
I'll try your solution asap.
Thank you very much.
Now I have a new issue that seems to be the hardware configuration, or could you plz give me some advice to fix it, thanks!
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 725, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in init _WrappedSession.init(self, self._create_session()) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session return self._sess_creator.create_session() File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session self.tf_sess = self._session_creator.create_session() File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session init_fn=self._scaffold.init_fn) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session config=config) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint sess = session.Session(self._target, graph=self._graph, config=config) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1585, in init super(Session, self).init(target, graph, config=config) File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 699, in init self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts) tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (3). Valid range is [0, 2]. while setting up XLA_GPU_JIT device number 3
More logs before the previous post:
2019-11-05 20:25:52.957209: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/anaconda3/lib/:
2019-11-05 20:25:52.960330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-05 20:25:52.960357: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2019-11-05 20:25:52.960723: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-11-05 20:25:52.969081: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2019-11-05 20:25:52.971085: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c1e97be5f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-11-05 20:25:52.971110: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2019-11-05 20:25:52.974750: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2019-11-05 20:25:52.975185: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2019-11-05 20:25:53.142180: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:7: failed initializing StreamExecutor for CUDA device ordinal 7: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
2019-11-05 20:25:53.154143: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
2019-11-05 20:25:53.163928: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:6: failed initializing StreamExecutor for CUDA device ordinal 6: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
2019-11-05 20:25:54.909159: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c1e93af0c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2019-11-05 20:25:54.909198: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-11-05 20:25:54.909207: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-11-05 20:25:54.909215: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla V100-PCIE-16GB, Compute Capability 7.0
Traceback (most recent call last):
File "/root/anaconda3/bin/melody_rnn_train", line 11, in
This looks like possibly an issue of Cuda version mismatch/ not added to path correctly.
Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...
Certainly points to that.
What's the output of
tf.test.is_gpu_available()
?
Hello,
I follow the instructions the reach the step "Training my own", and met an error below. plz help me to solve it. Thanks!
https://github.com/tensorflow/magenta/tree/master/magenta/models/melody_rnn
melody_rnn_train \ --config=attention_rnn \ --run_dir=/tmp/melody_rnn/logdir/run1 \ --sequence_example_file=/tmp/melody_rnn/sequence_examples/training_melodies.tfrecord \ --hparams="batch_size=64,rnn_layer_sizes=[64,64]" \ --num_training_steps=20000
ERROR: Traceback (most recent call last): File "/root/anaconda3/bin/melody_rnn_train", line 11, in
load_entry_point('magenta', 'console_scripts', 'melody_rnn_train')()
File "/eviluess/magenta/magenta/models/melody_rnn/melody_rnn_train.py", line 108, in console_entry_point
tf.app.run(main)
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/root/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/root/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/eviluess/magenta/magenta/models/melody_rnn/melody_rnn_train.py", line 104, in main
checkpoints_to_keep=FLAGS.num_checkpoints)
File "/eviluess/magenta/magenta/models/shared/events_rnn_train.py", line 84, in run_training
is_chief=task == 0)
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/contrib/training/python/training/training.py", line 549, in train
loss = session.run(train_op, run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 861, in exit
self._close_internal(exception_type)
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 899, in _close_internal
self._sess.close()
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1166, in close
self._sess.close()
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1334, in close
ignore_live_threads=True)
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/root/anaconda3/lib/python3.7/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257, in _run
enqueue_callable()
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/root/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Key: inputs, Index: 0. Number of float values != expected. values size: 38 but output shape: [74]
[[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
(base) ➜ ~