rwth-i6 / returnn-experiments

experiments with RETURNN
154 stars 44 forks source link

Question about 2020-rnn-transducer #49

Closed yanghongjiazheng closed 2 years ago

yanghongjiazheng commented 4 years ago

Dear author, I find that you have given the config files about the experiment <2020-rnn-transducer>. And this experiment is based on SWB corpus. I tried using the this corpus reproduce your results. But I'm not familiar with the features extraction process. I have not build and installed RASR toolkit successfully yet. And I do not understand why you used the gammatone features but not the mfcc features. May I use Librispeech corpus and mfcc features to reproduce your result. Or could you share some config files about transducer experiments using Librispeech corpus and mfcc features? Thanks a lot.

albertz commented 4 years ago

Thank you for the interest in our experiments.

We compared Gammatone (GT) vs MFCC at some point (for hybrid BLSTM-HMMs) and GT was consistently a bit better (on Switchboard). I also compared GT (in RASR) to MFCC (via librosa) on Switchboard for attention models, and GT seems to be much better in my initial experiments. Although the problem seems to be that the model overfits more easily with MFCC for some reason, and probably we can compensate that with more regularization. But this needs more investigation. I also compare log-mel features vs MFCC (both via librosa) on Librispeech (with attention models), and MFCC seems to be better. But maybe this is a matter of more tuning.

So, certainly, you can do MFCC, or log-mel features. You can use our simple librosa-based feature extraction pipeline for that (e.g. via the OggZipDataset), but you probably need to retune then, and will not get as good results with the same config (without tuning). This would need some work (for the tuning, and also for setting up an OggZipDataset).

Setting up the RASR pipeline is certainly more complicated, but if you invest some time, this is also doable. Here you find RASR config files (in config and flow), which should mostly work. I think you should find some scripts with RASR which will do the feature extraction offline and store it in RASR cache files.

Our Librispeech pipeline is certainly much simpler. You can mostly take the existing attention setup (use one which uses ogg files, for faster speed and less disk space) for the dataset (i.e. train and dev in the RETURNN config) (and then also fix extern_data or num_outputs), and otherwise use the transducer config. We are currently trying this ourselves, but cannot share experience yet.

Also, we are currently setting up a RETURNN config which simplifies the transducer training pipeline. To get to our final result from the paper, you first start with full-sum training, then extract an alignment, and then do frame-wise cross entropy training. We published all necessary scripts and configs for that, but it's currently a somewhat manual process. We have prepared a config which will do the whole pipeline in one go. This is mostly ready (we are still testing it), and we can publish it soon. Maybe we can share it already with you, if you are interested. Note that this takes long to train. It might make sense to not redo the first part of the training all the time, and use an existing alignment, so you safe the first part of the training. We are also experimenting on how to speed this up.

Maybe @Spotlight0xff wants to add sth regarding transducer, or the new simplified config?

Regarding RASR, maybe @curufinwe wants to add sth, regarding which RASR to take, and where you can find scripts and configs for GT feature extraction?

yanghongjiazheng commented 4 years ago

Of course, It will be appreciated if you can offer the config. And I have a question while building the RASR. I run the ./script/requirement.sh and it works g++ -DSPRINT_STANDARD_BUILD -DDBG_LEVEL=-1 -DPROC_x86_64 -DOS_linux -DARCH_linux_x86_64 -D_GNU_SOURCE -DOPENFST_1_6_7 -DENABLE_SSE2 -I. -I/home/research/jiayang/workspace/returnn-experiments/2020-transducer/returnn-experiments/2020-rnn-transducer/rasr/src -I/usr/include/ffmpeg -isystem /home/research/linqq/kaldi/tools/openfst/include -I/usr/include/libxml2 -I/usr/include/python2.7 -I/usr/include/python2.7 -pipe -funsigned-char -fno-exceptions -Wall -Wno-long-long -ffast-math -msse3 -O2 -std=gnu++0x -Wno-unknown-pragmas -Wno-deprecated -fno-strict-aliasing -DBISON_VERSION_2 -rdynamic -L/home/research/linqq/kaldi/tools/openfst/lib -Wl,-rpath=/home/research/linqq/kaldi/tools/openfst/lib -o /tmp/N2DyZYFH /tmp/bPWf5cg3.cc -lz library libz: OK But when I run make, there is a #error "SSSE3 instruction set not enabled". Have you encountered the same problem?

yanghongjiazheng commented 4 years ago

I'm trying use this config to train a full-sum ctc network using librispeech corpus. However, It went wrong. Can you offer some help?

albertz commented 4 years ago

About the RASR compile error, I don't know. You might want to report it here. You should provide some more information as well.

About the full-sum training, you forgot to tell what went wrong? I have no idea what went wrong. Btw, there is a small problem in the config. You should not use 'seq_postfix': [0] for your dataset (EOS is not needed for transducer).

yanghongjiazheng commented 4 years ago

I have figure out the RASR compile error. I change the instruction set to sse4.2. And I also contained my Exception log here just below the config file. And I'm sure my config files has so many mistakes. But due to the big change in retunn, I don't know how to fix it. Hope you can help me with it.

albertz commented 4 years ago

Ah sorry, I didn't saw that. But still, it would be useful to get the full stdout (from the very beginning).

In the error, you can ignore the part with FetchHelper. That doesn't matter.

The real problem/exception seems to be:

InvalidArgumentError (see above for traceback): slice index 2 of dimension 0 out of bounds.
     [[Node: optimize/gradients/output/rec/output_prob/while/cond/strided_slice_1_grad/StridedSliceGrad = StridedSliceGrad[Index=DT_INT32, T=DT_FLOAT, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/device:GPU:0"](optimize/gradients/output/rec/output_prob/while/cond/strided_slice_1_grad/Shape, optimize/gradients/output/rec/output_prob/while/cond/strided_slice_1_grad/StridedSliceGrad/StackPopV2/_775, optimize/gradients/output/rec/output_prob/while/cond/strided_slice_1_grad/StridedSliceGrad/StackPopV2_1/_777, optimize/gradients/output/rec/output_prob/while/cond/strided_slice_1_grad/StridedSliceGrad/Const_2, optimize/gradients/output/rec/output_prob/while/cond/Merge_grad/tuple/control_dependency)]]
     [[Node: optimize/gradients/b_count_2/_1101 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8019_optimize/gradients/b_count_2", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopoptimize/gradients/output/rec/output_prob/while/cond/strided_slice_1_grad/StridedSliceGrad/Switch_1/_265)]]

Currently I don't have a good idea why that happens. @Spotlight0xff maybe you have seen this before?

What TF version do you use? I would recommend to use TF 2.3.0. Also, I would recommend to not use Anaconda. They sometimes do strange things in their environment and I have seen all sorts of really strange problems caused by that.

yanghongjiazheng commented 4 years ago

I use the TF1.8.0.

albertz commented 4 years ago

I use the TF1.8.0.

Can you try TF 2.3.0? I think TF 1.8.0 is too old.

yanghongjiazheng commented 4 years ago

Can I just use LibrispeechDataset to train the Transducer ?

yanghongjiazheng commented 4 years ago

When I changed the TF 1.8 to TF 2.3, I got a problem returnn/returnn/tf/engine.py", line 851, in _check_devices line: assert is_gpu_available(), "no GPU available" locals: is_gpu_available = <local> <function is_gpu_available at 0x7fd3e5c1d400>

albertz commented 4 years ago

This looks like your TF 2.3 setup does not correctly find CUDA (did you install tensorflow-gpu). This is not really related to RETURNN. Please check that you correctly install TF 2.3 with GPU support, and that it finds the correct CUDA version.

yanghongjiazheng commented 4 years ago

Hi, I'm so excited. Finally, I successfully launched a training using Librispeech corpus and MFCC features. The cuda version is 10.0. Tensor-flow version is 1.15.0. I'm training the "Transducer model, CTC label topology, full-sum training". Now I need a specific pipeline to follow. Can you offer some help? For example, can you provide the procedures of how can I do this. And how to inference to test the performance of the models?

pretrain epoch 1, step 18, cost:output/output_prob 12.80347536937336, loss 116.378136, max_size:classes 5, max_size:data 417, mem_usage:GPU:0 1.9GB, num_seqs 9, 0.246 sec/step, elapsed 0:00:14, exp. remaining 0:00:27, complete 34.60%
pretrain epoch 1, step 19, cost:output/output_prob 16.785553228639856, loss 169.00238, max_size:classes 5, max_size:data 385, mem_usage:GPU:0 1.9GB, num_seqs 10, 0.304 sec/step, elapsed 0:00:14, exp. remaining 0:00:26, complete 35.83%
pretrain epoch 1, step 20, cost:output/output_prob 17.914719161829908, loss 198.20876, max_size:classes 6, max_size:data 353, mem_usage:GPU:0 1.9GB, num_seqs 11, 0.218 sec/step, elapsed 0:00:15, exp. remaining 0:00:25, complete 37.06%
pretrain epoch 1, step 21, cost:output/output_prob 17.11497132416389, loss 189.40993, max_size:classes 6, max_size:data 346, mem_usage:GPU:0 1.9GB, num_seqs 11, 0.277 sec/step, elapsed 0:00:15, exp. remaining 0:00:24, complete 38.28%
pretrain epoch 1, step 22, cost:output/output_prob 17.67794418334961, loss 142.5688, max_size:classes 6, max_size:data 461, mem_usage:GPU:0 1.9GB, num_seqs 8, 0.261 sec/step, elapsed 0:00:15, exp. remaining 0:00:24, complete 39.39%
pretrain epoch 1, step 23, cost:output/output_prob 16.994180551084582, loss 171.08705, max_size:classes 6, max_size:data 372, mem_usage:GPU:0 1.9GB, num_seqs 10, 0.232 sec/step, elapsed 0:00:15, exp. remaining 0:00:23, complete 40.74%
pretrain epoch 1, step 24, cost:output/output_prob 13.888980309500994, loss 140.03346, max_size:classes 6, max_size:data 372, mem_usage:GPU:0 1.9GB, num_seqs 10, 0.240 sec/step, elapsed 0:00:16, exp. remaining 0:00:22, complete 41.84%
pretrain epoch 1, step 25, cost:output/output_prob 13.098986625671387, loss 105.935555, max_size:classes 5, max_size:data 449, mem_usage:GPU:0 1.9GB, num_seqs 8, 0.284 sec/step, elapsed 0:00:16, exp. remaining 0:00:21, complete 43.07%
pretrain epoch 1, step 26, cost:output/output_prob 13.196998694516708, loss 119.91665, max_size:classes 6, max_size:data 442, mem_usage:GPU:0 1.9GB, num_seqs 9, 0.321 sec/step, elapsed 0:00:16, exp. remaining 0:00:21, complete 44.42%
albertz commented 4 years ago

The best pipeline we came up with in the paper is:

So, you basically have 3 training stages here, but you throw away the first model after extracting the alignments. (We are experimenting in variations of this training pipeline but have nothing to share at this moment.)

With respect to evaluating (any of the models at any stage), you can use the scripts/setup from here. See the recog there.

yanghongjiazheng commented 4 years ago

OK Thanks a lot. Your work is so great and interesting.

albertz commented 4 years ago

Thanks for the kind words. I'm very curious what results you get on Librispeech. We are also doing similar experiments now (but nothing to share yet).

yanghongjiazheng commented 4 years ago

I'll share the results with you when I finish this pipeline.