mravanelli / pytorch-kaldi

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
2.36k stars 446 forks source link

No Decoding Output #250

Closed kevinmchu closed 3 years ago

kevinmchu commented 3 years ago

I'm running the TIMIT LSTM on custom features, and I obtained the following error in my log.log file:

image

I checked my best path file, but did not see any error messages or warnings.

image

I've also double checked my cfg file, and all of the directories exist. I'm running Ubuntu 16.04, CUDA 10.2, PyTorch 1.7.1. What am I doing wrong?

TParcollet commented 3 years ago

Hi, as we can see from the log, "Done 0 lattices" hence something went wrong during the forward phase. I would recommend removing all the directories related to the decoding and remove the forward files generated by pytorch Kaldi (the one created when forward the test). Then start again and check that the forward process goes smoothly.

kevinmchu commented 3 years ago

Thanks for the quick reply. I removed the decoding directories and forward files and reran the model on the test set, but I obtained the same error as before.

TParcollet commented 3 years ago

Does the forward phase runs smoothly ? Can you see it ?

kevinmchu commented 3 years ago

This is the output I obtain when I run the model on the test data:

Testing TIMIT_test chunk = 1 / 1 [========================================] 100% Forwarding | (Batch 192/192)) Decoding TIMIT_test output out_dnn2

Does this indicate that the forward phase ran smoothly?

TParcollet commented 3 years ago

Yep. Is the final.mdl model existing ? Can you check his size ? Also, you could try to run manually the Kaldi command line that fails ..

kevinmchu commented 3 years ago

Yes, final.mdl exists and has a size of 5.2MB.

As for manually re-running latgen-faster-mapped, where can I find the values of the $thread_string, $min_active, $max_active, etc.?

kevinmchu commented 3 years ago

Also, I was able to run the decoder for an LSTM trained on MFCCs, which makes me think there is something wrong with my features.

TParcollet commented 3 years ago

Weird ..

kevinmchu commented 3 years ago

@mravanelli Do you have any insight about this issue?

TParcollet commented 3 years ago

It is most likely that the forwarded data are empty. How fast was the forward phase ? If it is super quick, it might indicate that your input features are indeed not good. You definitely should try to call the command manually to get the different output, like checking if the lattices are empty.

kevinmchu commented 3 years ago

The forward phase lasted ~10 minutes. I ran latgen-faster-mapped without any errors, but the lattices were empty.

TParcollet commented 3 years ago

So in TIMIT_test output out_dnn2 all the lat.*.gz are empty ?

TParcollet commented 3 years ago

If so please check that your $finalfeats (don't know where you saved them) are ok (not empty).

kevinmchu commented 3 years ago

I just realized I forgot to change the fea_name in the configuration file. However, when I changed fea_name to the correct name, I obtained this error:

ERROR: the input "mfcc" is not defined before (possible inputs are ['xxxx'])

kevinmchu commented 3 years ago

I removed all directories relating to the trained model and re-trained over 1 epoch. However, I am not getting a final.mdl file when training finishes. The log.log file does not show any errors or warnings. I did receive this warning on the terminal:

/home/lab/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_base.py:1717: UserWarning: Attempting to set identical left==right results in singular transformations; automatically expanding. left=0, right=0 self.set_xlim([v[0], v[1]], emit=emit, auto=False)

Does this explain the missing final.mdl file?

TParcollet commented 3 years ago

No, the final.mdl only appears if you reach the number of epochs given in the config file.

kevinmchu commented 3 years ago

To clarify, in the cfg file I set n_epochs_tr to 1 but still did not get a final.mdl. Is there something else I am supposed to change if I only want tot train over 1 epoch?

kevinmchu commented 3 years ago

I solved the problem with the missing final.mdl. I split run_exp.py into training and testing scripts, and it turns out I needed to run the testing script for final.mdl to appear.

However, I am experiencing the same problem as before during decoding where the forward phase runs smoothly, but I do not obtain any output. My lat.1.gz file is only 20 bytes. forward_TIMIT_test_ep0_ck0_out_dnn2_to_decode.ark is 2.1GB, which seems reasonable. Any other ideas?

kevinmchu commented 3 years ago

@TParcollet @mravanelli I just wanted to follow up and ask if you have any more insight about this issue.

kevinmchu commented 3 years ago

I figured out the problem. The issue was a mismatch between my lab_graph and lab_folder, which by default raises a segmentation fault.