Open alessiaatunimi opened 5 years ago
Hm, this is strange.
First, can you check that it works on cpu:
$ bazel clean
$ bazel build -c opt lingvo:trainer (note: no --config=cuda)
$ CUDA_VISIBLE_DEVICES= bazel-bin/lingvo/trainer --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr
Then if that works try with gpu
$ bazel clean
$ bazel build -c opt --config=cuda lingvo:trainer
$ bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr
Still not working. Important question: is the output right until the unexpected end? Just to know if I'm completely wrong or if something is how it should be.. (let me know if you need the not-summarized-file of the output) here's the details about the cpu and gpu of the server I'm working on:
When I ran the model on cpu, it ends in segmentation fault. Here is the total error output: ASRtrain_cpu.pdf Do you know what went wrong? Thanks, Alessia
Could it be a problem of the dataset that hasn't been perfectly downloaded and parametrized?
The CPU segfault is weird. can you see if this test passes
bazel test -c opt //lingvo/core/ops:beam_search_step_op_test
Here's the output:
Ok, I think this might be this compiler mismatch problem with the latest tf-nightly build that we ran into last week.
Can you try pulling the latest version of lingvo, installing the latest version of tf-nightly (pip install tf-nightly --force-reinstall), and making sure you have g++-7 as your default c++ compiler.
I pulled the latest version of Lingvo and installed tf-nightly. The compiler of the server I'm working on isn't the right one, the available ones are
libquadmath0 - GCC Quad-Precision Math Library
libgomp1 - GCC OpenMP (GOMP) support library
cpp - GNU C preprocessor (cpp)
g++-4.8 - GNU C++ compiler
gcc-4.8-base - GCC, the GNU Compiler Collection (base package)
libmpx0 - Intel memory protection extensions (runtime)
gcc-6-base - GCC, the GNU Compiler Collection (base package)
gcc - GNU C compiler
gcc-4.8 - GNU C compiler
pkg-config - manage compile and link flags for libraries
dpkg-dev - Debian package development tools
cpp-4.8 - GNU C preprocessor
g++ - GNU C++ compiler
libstdc++6 - GNU Standard C++ Library v3
binutils - GNU assembler, linker and binary utilities
gcc-5 - GNU C compiler
g++-5 - GNU C++ compiler
cpp-5 - GNU C preprocessor
libsepol1 - SELinux library for manipulating binary security policies
gcc-5-base - GCC, the GNU Compiler Collection (base package)
So I'm trying to find a way to install the right one, but I haven't root access
The packages inside your server is not important, the packages inside docker is what matters.
The docker file has been updated to g++7 here: https://github.com/tensorflow/lingvo/blob/29099ef71c9d9eac66e35ff27371479f284c0c7a/docker/dev.dockerfile#L27
So hopefully if you just re-build docker things should work
By "rebuild docker" you mean like update it or just run this:
docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < lingvo/docker/dev.dockerfile
?
Yes, docker build again. Make sure to set --no-cache so it doesn't try to use cached packages.
Here's what I've done:
docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < lingvo/docker/dev.dockerfile --no-cache
docker run --rm $(test gpu = "gpu") -it -v lingvo:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8885:8885 --name lingvo tensorflow:lingvo bash
-being In the container-$ bazel clean
$ bazel build -c opt --config=cuda lingvo:trainer
$ bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr
I get the usual error of seg fault. Maybe I messed up with the dataset and parameterization, could be that the problem?
Can you try the bazel test -c opt //lingvo/core/ops:beam_search_step_op_test again? If it still fails something is going wrong. In that case please run https://github.com/tensorflow/lingvo/blob/master/tf_env_collect.sh and paste the outputs.
If that test passes, then it's a different segfault in the trainer and we will need to try to track that down.
_bazel test -c opt //lingvo/core/ops:beam_search_step_optest failed again, the output of _tf_envcollect is the following:
root@f76131155897:/tmp# lingvo/tf_env_collect.sh
Collecting system information...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
WARNING: Logging before flag parsing goes to stderr.
W0814 15:06:51.856008 140408205829888 module_wrapper.py:136] From /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/util/module_wrapper.py:163: The name tf.VERSION is deprecated. Please use tf.version.VERSION instead.
2019-08-14 15:06:51.856860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-08-14 15:06:51.876890: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz
2019-08-14 15:06:51.878904: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x51086a0 executing computations on platform Host. Devices:
2019-08-14 15:06:51.878947: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
Wrote environment to tf_env.txt. You can review the contents of that file.
and use it to populate the fields in the github issue template.
cat tf_env.txt
Sorry for not being clear, you need to actually post the output of the tf_env.txt.
But yes, it seems that for some reason the environment is still not set up correctly...
Please try these exact commands, they should work (I just tried them):
LINGVO_DIR="/tmp/lingvo"
rm -rf "$LINGVO_DIR"
git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR"
cd "$LINGVO_DIR"
docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache
docker run --rm -it -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash
bazel test -c opt //lingvo/core/ops:beam_search_step_op_test
Ok, now the test passed! This means that I have the environment set up correctly, doesn't it? So next step is to copy the downloaded dataset and run _librispeech.03.parameterizetrain.sh and _librispeech.04.parameterizedevtest.sh? Thanks to the -v option, the data in this container would be persistent, am I right?
that should be the case.
it's going on (slowly) like this:
I0816 16:52:06.508646 139673035323136 trainer.py:526] step: 4 fraction_of_correct_next_step_preds:0.0068717278 fraction_of_correct_next_step_preds/logits:0.0068717278 grad_norm/all/loss:143.61938 grad_scale_all/loss:0 log_pplx:4.9051819 log_pplx/logits:4.9051819 loss:4.9051819 loss/logits:4.9051819 num_samples_in_batch:48 token_normed_prob:0.0074080951 token_normed_prob/logits:0.0074080951 var_norm/all/loss:705.65021 I0816 16:52:12.307877 139673043715840 trainer.py:377] Steps/second: 0.010919, Examples/second: 0.524132 I0816 16:52:22.327208 139673043715840 trainer.py:377] Steps/second: 0.010629, Examples/second: 0.510178 I0816 16:52:32.326913 139673043715840 trainer.py:377] Steps/second: 0.010354, Examples/second: 0.496973 I0816 16:52:42.339401 139673043715840 trainer.py:377] Steps/second: 0.010092, Examples/second: 0.484419 I0816 16:52:52.338437 139673043715840 trainer.py:377] Steps/second: 0.009844, Examples/second: 0.472499 I0816 16:53:02.349607 139673043715840 trainer.py:377] Steps/second: 0.009607, Examples/second: 0.461138 I0816 16:53:12.357919 139673043715840 trainer.py:377] Steps/second: 0.009382, Examples/second: 0.450313 I0816 16:53:22.371999 139673043715840 trainer.py:377] Steps/second: 0.009166, Examples/second: 0.439980 I0816 16:53:29.629813 139673035323136 trainer.py:526] step: 5 fraction_of_correct_next_step_preds:0.0074134138 fraction_of_correct_next_step_preds/logits:0.0074134138 grad_norm/all/loss:171.63779 grad_scale_all/loss:0 log_pplx:4.8955145 log_pplx/logits:4.8955145 loss:4.8955145 loss/logits:4.8955145 num_samples_in_batch:48 token_normed_prob:0.0074800598 token_normed_prob/logits:0.0074800598 var_norm/all/loss:705.65021 2019-08-16 16:53:29.635178: I lingvo/core/ops/record_batcher.cc:356] 543 total seconds passed. Total records yielded: 711. Total records skipped: 1 I0816 16:53:32.371682 139673043715840 trainer.py:377] Steps/second: 0.011201, Examples/second: 0.537654 I0816 16:53:42.380145 139673043715840 trainer.py:377] Steps/second: 0.010955, Examples/second: 0.525864 I0816 16:53:52.394942 139673043715840 trainer.py:377] Steps/second: 0.010720, Examples/second: 0.514572 I0816 16:54:02.397754 139673043715840 trainer.py:377] Steps/second: 0.010495, Examples/second: 0.503768
I think it could be right, isn't it?
That looks right. Now you can try building the gpu docker and running on gpu for faster training.
Really thanks for all your help! However, how much time should it take with gpu?
Also: I still get those warnings:
W0816 20:05:53.473803 139673043715840 meta_graph.py:448] Issue encountered when serializing __batch_norm_update_dict. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'dict' object has no attribute 'name' W0816 20:05:53.475234 139673043715840 meta_graph.py:448] Issue encountered when serializing __model_split_id_stack. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'list' object has no attribute 'name'
and
2019-08-16 19:55:35.868151: W ./lingvo/core/ops/tokenizer_op_headers.h:64] Too long target 304 all this was a matter of notoriety in the city and [....]
are they worthy to worry about or I can ignore them since the training seems to go on?
Should take a couple of days total. The warnings are fine.
That looks right. Now you can try building the gpu docker and running on gpu for faster training.
By building the gpu docker you mean just run:
$ bazel clean $ bazel build -c opt --config=cuda lingvo:trainer $ bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr
instead of
$ bazel clean $ bazel build -c opt lingvo:trainer (note: no --config=cuda) $ CUDA_VISIBLE_DEVICES= bazel-bin/lingvo/trainer --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr
inside the container created with:
$ LINGVO_DIR="/tmp/lingvo" $ rm -rf "$LINGVO_DIR" $ git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR" $ cd "$LINGVO_DIR" $ docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache $ docker run --rm -it -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash
?
Hi everyone, I tried to run the ASR task with both the Librispeech960Base model and the Librispeech960Grapheme one. I don't know exactly the output that I should get to be sure that everything ended well, but I'm quite sure that mine output wasn't the right one because it ended at the first checkpoint. I attached the version of the output summarized by me, tell me if you need the original one to help me. Thanks a lot gpu_assertfalse_cutted.pdf
(I tried also with different run metrics:
`bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr \
bazel-bin/lingvo/trainer --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr
but the output was nearly the same