tensorflow / lingvo

Lingvo
Apache License 2.0
2.82k stars 445 forks source link

About build output in ASR #136

Open alessiaatunimi opened 5 years ago

alessiaatunimi commented 5 years ago

Hi everyone, I tried to run the ASR task with both the Librispeech960Base model and the Librispeech960Grapheme one. I don't know exactly the output that I should get to be sure that everything ended well, but I'm quite sure that mine output wasn't the right one because it ended at the first checkpoint. I attached the version of the output summarized by me, tell me if you need the original one to help me. Thanks a lot gpu_assertfalse_cutted.pdf

(I tried also with different run metrics:

  1. `bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr \

      --model=asr.librispeech.Librispeech960Grapheme --mode=sync \
      --logdir=/tmp/ebs/lingvo/librispeech --saver_max_to_keep=2 \
      --run_locally=gpu 2>&1 |& tee run.log`
  2. bazel-bin/lingvo/trainer --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr

but the output was nearly the same

jonathanasdf commented 5 years ago

Hm, this is strange.

First, can you check that it works on cpu:

$ bazel clean
$ bazel build -c opt lingvo:trainer   (note: no --config=cuda)
$ CUDA_VISIBLE_DEVICES= bazel-bin/lingvo/trainer --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr

Then if that works try with gpu

$ bazel clean
$ bazel build -c opt --config=cuda lingvo:trainer
$ bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr
alessiaatunimi commented 5 years ago

Still not working. Important question: is the output right until the unexpected end? Just to know if I'm completely wrong or if something is how it should be.. (let me know if you need the not-summarized-file of the output) here's the details about the cpu and gpu of the server I'm working on:

Screen Shot 2019-08-08 at 2 05 11 PM

When I ran the model on cpu, it ends in segmentation fault. Here is the total error output: ASRtrain_cpu.pdf Do you know what went wrong? Thanks, Alessia

alessiaatunimi commented 5 years ago

Could it be a problem of the dataset that hasn't been perfectly downloaded and parametrized?

jonathanasdf commented 5 years ago

The CPU segfault is weird. can you see if this test passes

bazel test -c opt //lingvo/core/ops:beam_search_step_op_test

alessiaatunimi commented 5 years ago

Here's the output:

Screen Shot 2019-08-09 at 4 14 47 PM
jonathanasdf commented 5 years ago

Ok, I think this might be this compiler mismatch problem with the latest tf-nightly build that we ran into last week.

Can you try pulling the latest version of lingvo, installing the latest version of tf-nightly (pip install tf-nightly --force-reinstall), and making sure you have g++-7 as your default c++ compiler.

alessiaatunimi commented 5 years ago

I pulled the latest version of Lingvo and installed tf-nightly. The compiler of the server I'm working on isn't the right one, the available ones are

libquadmath0 - GCC Quad-Precision Math Library
libgomp1 - GCC OpenMP (GOMP) support library
cpp - GNU C preprocessor (cpp)
g++-4.8 - GNU C++ compiler
gcc-4.8-base - GCC, the GNU Compiler Collection (base package)
libmpx0 - Intel memory protection extensions (runtime)
gcc-6-base - GCC, the GNU Compiler Collection (base package)
gcc - GNU C compiler
gcc-4.8 - GNU C compiler
pkg-config - manage compile and link flags for libraries
dpkg-dev - Debian package development tools
cpp-4.8 - GNU C preprocessor
g++ - GNU C++ compiler
libstdc++6 - GNU Standard C++ Library v3
binutils - GNU assembler, linker and binary utilities
gcc-5 - GNU C compiler
g++-5 - GNU C++ compiler
cpp-5 - GNU C preprocessor
libsepol1 - SELinux library for manipulating binary security policies
gcc-5-base - GCC, the GNU Compiler Collection (base package)

So I'm trying to find a way to install the right one, but I haven't root access

jonathanasdf commented 5 years ago

The packages inside your server is not important, the packages inside docker is what matters.

The docker file has been updated to g++7 here: https://github.com/tensorflow/lingvo/blob/29099ef71c9d9eac66e35ff27371479f284c0c7a/docker/dev.dockerfile#L27

So hopefully if you just re-build docker things should work

alessiaatunimi commented 5 years ago

By "rebuild docker" you mean like update it or just run this: docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < lingvo/docker/dev.dockerfile?

jonathanasdf commented 5 years ago

Yes, docker build again. Make sure to set --no-cache so it doesn't try to use cached packages.

alessiaatunimi commented 5 years ago

Here's what I've done:

  1. In the folder where I cloned the repository: docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < lingvo/docker/dev.dockerfile --no-cache
  2. docker run --rm $(test gpu = "gpu") -it -v lingvo:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8885:8885 --name lingvo tensorflow:lingvo bash -being In the container-
  3. $ bazel clean
    $ bazel build -c opt --config=cuda lingvo:trainer
    $ bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr

I get the usual error of seg fault. Maybe I messed up with the dataset and parameterization, could be that the problem?

jonathanasdf commented 5 years ago

Can you try the bazel test -c opt //lingvo/core/ops:beam_search_step_op_test again? If it still fails something is going wrong. In that case please run https://github.com/tensorflow/lingvo/blob/master/tf_env_collect.sh and paste the outputs.

If that test passes, then it's a different segfault in the trainer and we will need to try to track that down.

alessiaatunimi commented 5 years ago

_bazel test -c opt //lingvo/core/ops:beam_search_step_optest failed again, the output of _tf_envcollect is the following:

root@f76131155897:/tmp# lingvo/tf_env_collect.sh
Collecting system information...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
WARNING: Logging before flag parsing goes to stderr.
W0814 15:06:51.856008 140408205829888 module_wrapper.py:136] From /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/util/module_wrapper.py:163: The name tf.VERSION is deprecated. Please use tf.version.VERSION instead.

2019-08-14 15:06:51.856860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-08-14 15:06:51.876890: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz
2019-08-14 15:06:51.878904: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x51086a0 executing computations on platform Host. Devices:
2019-08-14 15:06:51.878947: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
Wrote environment to tf_env.txt. You can review the contents of that file.
and use it to populate the fields in the github issue template.

cat tf_env.txt
jonathanasdf commented 5 years ago

Sorry for not being clear, you need to actually post the output of the tf_env.txt.

But yes, it seems that for some reason the environment is still not set up correctly...

Please try these exact commands, they should work (I just tried them):

LINGVO_DIR="/tmp/lingvo"
rm -rf "$LINGVO_DIR"
git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR"
cd "$LINGVO_DIR"
docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache
docker run --rm -it -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash
bazel test -c opt //lingvo/core/ops:beam_search_step_op_test
alessiaatunimi commented 5 years ago

Ok, now the test passed! This means that I have the environment set up correctly, doesn't it? So next step is to copy the downloaded dataset and run _librispeech.03.parameterizetrain.sh and _librispeech.04.parameterizedevtest.sh? Thanks to the -v option, the data in this container would be persistent, am I right?

jonathanasdf commented 5 years ago

that should be the case.

alessiaatunimi commented 5 years ago

it's going on (slowly) like this:

I0816 16:52:06.508646 139673035323136 trainer.py:526] step: 4 fraction_of_correct_next_step_preds:0.0068717278 fraction_of_correct_next_step_preds/logits:0.0068717278 grad_norm/all/loss:143.61938 grad_scale_all/loss:0 log_pplx:4.9051819 log_pplx/logits:4.9051819 loss:4.9051819 loss/logits:4.9051819 num_samples_in_batch:48 token_normed_prob:0.0074080951 token_normed_prob/logits:0.0074080951 var_norm/all/loss:705.65021 I0816 16:52:12.307877 139673043715840 trainer.py:377] Steps/second: 0.010919, Examples/second: 0.524132 I0816 16:52:22.327208 139673043715840 trainer.py:377] Steps/second: 0.010629, Examples/second: 0.510178 I0816 16:52:32.326913 139673043715840 trainer.py:377] Steps/second: 0.010354, Examples/second: 0.496973 I0816 16:52:42.339401 139673043715840 trainer.py:377] Steps/second: 0.010092, Examples/second: 0.484419 I0816 16:52:52.338437 139673043715840 trainer.py:377] Steps/second: 0.009844, Examples/second: 0.472499 I0816 16:53:02.349607 139673043715840 trainer.py:377] Steps/second: 0.009607, Examples/second: 0.461138 I0816 16:53:12.357919 139673043715840 trainer.py:377] Steps/second: 0.009382, Examples/second: 0.450313 I0816 16:53:22.371999 139673043715840 trainer.py:377] Steps/second: 0.009166, Examples/second: 0.439980 I0816 16:53:29.629813 139673035323136 trainer.py:526] step: 5 fraction_of_correct_next_step_preds:0.0074134138 fraction_of_correct_next_step_preds/logits:0.0074134138 grad_norm/all/loss:171.63779 grad_scale_all/loss:0 log_pplx:4.8955145 log_pplx/logits:4.8955145 loss:4.8955145 loss/logits:4.8955145 num_samples_in_batch:48 token_normed_prob:0.0074800598 token_normed_prob/logits:0.0074800598 var_norm/all/loss:705.65021 2019-08-16 16:53:29.635178: I lingvo/core/ops/record_batcher.cc:356] 543 total seconds passed. Total records yielded: 711. Total records skipped: 1 I0816 16:53:32.371682 139673043715840 trainer.py:377] Steps/second: 0.011201, Examples/second: 0.537654 I0816 16:53:42.380145 139673043715840 trainer.py:377] Steps/second: 0.010955, Examples/second: 0.525864 I0816 16:53:52.394942 139673043715840 trainer.py:377] Steps/second: 0.010720, Examples/second: 0.514572 I0816 16:54:02.397754 139673043715840 trainer.py:377] Steps/second: 0.010495, Examples/second: 0.503768

I think it could be right, isn't it?

jonathanasdf commented 5 years ago

That looks right. Now you can try building the gpu docker and running on gpu for faster training.

alessiaatunimi commented 5 years ago

Really thanks for all your help! However, how much time should it take with gpu?

alessiaatunimi commented 5 years ago

Also: I still get those warnings:

W0816 20:05:53.473803 139673043715840 meta_graph.py:448] Issue encountered when serializing __batch_norm_update_dict. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'dict' object has no attribute 'name' W0816 20:05:53.475234 139673043715840 meta_graph.py:448] Issue encountered when serializing __model_split_id_stack. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'list' object has no attribute 'name' and 2019-08-16 19:55:35.868151: W ./lingvo/core/ops/tokenizer_op_headers.h:64] Too long target 304 all this was a matter of notoriety in the city and [....] are they worthy to worry about or I can ignore them since the training seems to go on?

jonathanasdf commented 5 years ago

Should take a couple of days total. The warnings are fine.

alessiaatunimi commented 5 years ago

That looks right. Now you can try building the gpu docker and running on gpu for faster training.

By building the gpu docker you mean just run: $ bazel clean $ bazel build -c opt --config=cuda lingvo:trainer $ bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr

instead of

$ bazel clean $ bazel build -c opt lingvo:trainer (note: no --config=cuda) $ CUDA_VISIBLE_DEVICES= bazel-bin/lingvo/trainer --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/asr/log --logtostderr

inside the container created with:

$ LINGVO_DIR="/tmp/lingvo" $ rm -rf "$LINGVO_DIR" $ git clone https://github.com/tensorflow/lingvo.git "$LINGVO_DIR" $ cd "$LINGVO_DIR" $ docker build --tag tensorflow:lingvo - < docker/dev.dockerfile --no-cache $ docker run --rm -it -v ${LINGVO_DIR}:/tmp/lingvo --name lingvo tensorflow:lingvo bash ?