pzelasko commented 3 years ago

It's basically a copy-paste of LibriSpeech with a bit different data preparation. I am currently running the training on the smallest subset -- XS, to see if everything's working OK.

I hope we'll find a solution sometime soon in which we won't have to copy all these bash scripts in local for each new recipe.

pzelasko commented 3 years ago

I have an error in graph compilation when I attempt decoding -- could it be too large?

2021-07-09 09:56:24,627 DEBUG [mmi_att_transformer_decode.py:490] Loading L_disambig.fst.txt
2021-07-09 09:56:32,290 DEBUG [mmi_att_transformer_decode.py:493] Loading G.fst.txt
2021-07-09 09:57:44,288 INFO [graph.py:49] Intersecting L and G
2021-07-09 10:02:10,903 INFO [graph.py:51] LG shape = (153854092, None)
2021-07-09 10:02:10,903 INFO [graph.py:52] Connecting L*G
2021-07-09 10:02:10,904 INFO [graph.py:54] LG shape = (153854092, None)
2021-07-09 10:02:10,904 INFO [graph.py:55] Determinizing L*G
2021-07-09 10:09:50,677 INFO [graph.py:57] LG shape = (122341660, None)
2021-07-09 10:09:50,678 INFO [graph.py:58] Connecting det(L*G)
2021-07-09 10:09:50,678 INFO [graph.py:60] LG shape = (122341660, None)
2021-07-09 10:09:50,678 INFO [graph.py:61] Removing disambiguation symbols on L*G
2021-07-09 10:09:53,055 INFO [graph.py:67] Removing epsilons
[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/tensor.cu:159:k2::Tensor::Tensor(k2::Dtype, const k2::Shape&, k2::RegionPtr, int32_t) Check failed: int64_t(impl_->byte_offset) + begin_elem * element_size >= 0 (-1938201564 vs. 0)

[ Stack-Trace: ]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x2aab3ff9021c]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Tensor::Tensor(k2::Dtype, k2::Shape const&, std::shared_ptr<k2::Region>, int)+0x6da) [0x2aab3b54a4ea]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Array2<int>::Col(int)+0x13a) [0x2aab3b4f5dda]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(+0x270899) [0x2aab3b4e9899]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Index(k2::RaggedShape&, int, k2::Array1<int> const&, k2::Array1<int>*)+0x1da) [0x2aab3b4ebf4a]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb7335) [0x2aab3a342335]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x9d750) [0x2aab3a328750]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x1bdaf) [0x2aab3a2a6daf]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/bin/python(PyCFunction_Call+0x58) [0x5555556a82d8]
...

Traceback (most recent call last):
  File "./mmi_att_transformer_decode.py", line 608, in <module>
    main()
  File "./mmi_att_transformer_decode.py", line 498, in main
    HLG = compile_HLG(L=L,
  File "/exp/pzelasko/snowfall/snowfall/decoding/graph.py", line 68, in compile_HLG
    LG = k2.remove_epsilon(LG)
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/fsa_algo.py", line 621, in remove_epsilon
    out_fsa = k2.utils.fsa_from_unary_function_ragged(fsa, ragged_arc, arc_map,
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/utils.py", line 508, in fsa_from_unary_function_ragged
    new_value = index(value, arc_map)
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/ops.py", line 335, in index
    return index_ragged(src, indexes)
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/ops.py", line 283, in index_ragged
    return _k2.index(src, indexes)
RuntimeError: Some bad things happed.

jtrmal commented 3 years ago

it's right there -- from the log I can clearly tell some bad things happed (sic). :) y.

On Fri, Jul 9, 2021 at 4:16 PM Piotr Żelasko @.***> wrote:

I have an error in graph compilation when I attempt decoding -- could it be too large?

2021-07-09 09:56:24,627 DEBUG [mmi_att_transformer_decode.py:490] Loading L_disambig.fst.txt 2021-07-09 09:56:32,290 DEBUG [mmi_att_transformer_decode.py:493] Loading G.fst.txt 2021-07-09 09:57:44,288 INFO [graph.py:49] Intersecting L and G 2021-07-09 10:02:10,903 INFO [graph.py:51] LG shape = (153854092, None) 2021-07-09 10:02:10,903 INFO [graph.py:52] Connecting LG 2021-07-09 10:02:10,904 INFO [graph.py:54] LG shape = (153854092, None) 2021-07-09 10:02:10,904 INFO [graph.py:55] Determinizing LG 2021-07-09 10:09:50,677 INFO [graph.py:57] LG shape = (122341660, None) 2021-07-09 10:09:50,678 INFO [graph.py:58] Connecting det(LG) 2021-07-09 10:09:50,678 INFO [graph.py:60] LG shape = (122341660, None) 2021-07-09 10:09:50,678 INFO [graph.py:61] Removing disambiguation symbols on LG 2021-07-09 10:09:53,055 INFO [graph.py:67] Removing epsilons [F] /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/tensor.cu:159:k2::Tensor::Tensor(k2::Dtype, const k2::Shape&, k2::RegionPtr, int32_t) Check failed: int64t(impl->byte_offset) + begin_elem * element_size >= 0 (-1938201564 vs. 0)

[ Stack-Trace: ] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x2aab3ff9021c] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Tensor::Tensor(k2::Dtype, k2::Shape const&, std::shared_ptr, int)+0x6da) [0x2aab3b54a4ea] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Array2::Col(int)+0x13a) [0x2aab3b4f5dda] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(+0x270899) [0x2aab3b4e9899] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Index(k2::RaggedShape&, int, k2::Array1 const&, k2::Array1*)+0x1da) [0x2aab3b4ebf4a] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb7335) [0x2aab3a342335] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x9d750) [0x2aab3a328750] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x1bdaf) [0x2aab3a2a6daf] /home/hltcoe/pzelasko/miniconda3/envs/snowfall2/bin/python(PyCFunction_Call+0x58) [0x5555556a82d8] ...

Traceback (most recent call last): File "./mmi_att_transformer_decode.py", line 608, in main() File "./mmi_att_transformer_decode.py", line 498, in main HLG = compile_HLG(L=L, File "/exp/pzelasko/snowfall/snowfall/decoding/graph.py", line 68, in compile_HLG LG = k2.remove_epsilon(LG) File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/fsa_algo.py", line 621, in remove_epsilon out_fsa = k2.utils.fsa_from_unary_function_ragged(fsa, ragged_arc, arc_map, File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/utils.py", line 508, in fsa_from_unary_function_ragged new_value = index(value, arc_map) File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/ops.py", line 335, in index return index_ragged(src, indexes) File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/ops.py", line 283, in index_ragged return _k2.index(src, indexes) RuntimeError: Some bad things happed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-877221775, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX24OMVQHYTCRQN5FODTW4ADVANCNFSM5ABTS5CQ .

pzelasko commented 3 years ago

@danpovey @csukuangfj FYI I got past the graph compiling issue by pruning the LM (just moving from 4-gram to 3-gram didn't help); but that means we have to be careful when comparing with Kaldi's numbers, which used a 4-gram LM for decoding.

pzelasko commented 3 years ago

... it'd be good to improve the error message to sth like "The Fsa involved in this operation is too large to fit in the device memory."

pzelasko commented 3 years ago

Initial baseline results for the XS split (using the latest Conformer + MMI, 3-gram pruned to 1e-7 prob for decoding, no rescoring by default)

No acoustic context

10 epochs, avg 5

[DEV-no_rescore] %WER 57.60% [77720 / 134937, 365 ins, 55419 del, 21936 sub ]
[TEST-no_rescore] %WER 57.66% [237705 / 412242, 1027 ins, 172293 del, 64385 sub ]

20 epochs, avg 5

[DEV-no_rescore] %WER 55.43% [74789 / 134937, 560 ins, 50687 del, 23542 sub ]
[TEST-no_rescore] %WER 55.47% [228657 / 412242, 1609 ins, 156260 del, 70788 sub ]

30 epochs, avg 5

[DEV-no_rescore] %WER 55.37% [74721 / 134937, 657 ins, 49832 del, 24232 sub ]
[TEST-no_rescore] %WER 55.66% [229438 / 412242, 1719 ins, 154990 del, 72729 sub ]

Kaldi's numbers for this (using 4-gram decoding and RNNLM rescoring) are:

%WER 48.83 [ 62400 / 127790, 6087 ins, 12803 del, 43510 sub ] exp/chain_cleaned/cnn_tdnn_1c_spi_XS/decode_gigaspeech_dev_rnnlm/wer_10_0.5
%WER 48.17 [ 188198 / 390721, 20096 ins, 36896 del, 131206 sub ] exp/chain_cleaned/cnn_tdnn_1c_sp_XS/decode_gigaspeech_test_rnnlm/wer_9_0.5

I tried rescoring with 4-gram but it's either taking forever (whole lattice) or blowing up mem / some bad things happening with 100 best paths.

The error for 100 best:

[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect.cu:768:void k2::DeviceIntersector::PossiblyResizeHash(int32_t, int32_t) Check failed: min_num_buckets >= 0 (-1157976916 vs. 0)

[ Stack-Trace: ]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x2aab3ff9021c]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::DeviceIntersector::ForwardSortedA()+0x24a1) [0x2aab3b46b171]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDevice(k2::Ragged<k2::Arc>&, int, k2::Ragged<k2::Arc>&, int, k2::Array1<int> const&, k2::Array1<int>*, k2::Array1<int>*, bool)+0x3a5) [0x2aab3b44b135]

pzelasko commented 3 years ago

Onto training with real acoustic context, which looks like this for 20s context windows:

csukuangfj commented 3 years ago

... it'd be good to improve the error message to sth like "The Fsa involved in this operation is too large to fit in the device memory."

I am wondering whether it's possible to output such a detailed message. @danpovey what do you think?

When you see Some bad things happed at the bottom, you're expected to check the error log message, which contains the line causing the error. The stack trace is also contained in the error log.

danpovey commented 3 years ago

We need to debug that thing with the Tensor assertion. It is not a simple question of overflow; that code should not overflow. Should get it in a debugger and print out the shape and byte_offset, and figure out where it came from. The k2 version/hash would help.

chenguoguo commented 3 years ago

Piotr what is the GigaSpeech version that you are using? The latest is version 1.0.0. If you are using the version that Yenda downloaded, it’s possible that the evaluation set is outdated. If that is the case, I can work with Yenda to download the latest to CLSP.

On Fri, Jul 9, 2021 at 8:48 PM Daniel Povey @.***> wrote:

We need to debug that thing with the Tensor assertion. It is not a simple question of overflow; that code should not overflow. Should get it in a debugger and print out the shape and byte_offset, and figure out where it came from. The k2 version/hash would help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-877556992, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ57EZM2A5NRVXLWUO2X5TTW67IPANCNFSM5ABTS5CQ .

csukuangfj commented 3 years ago

Initial baseline results for the XS split

snowfall

[DEV-no_rescore] %WER 55.37% [74721 / 134937, 657 ins, 49832 del, 24232 sub ]
[TEST-no_rescore] %WER 55.66% [229438 / 412242, 1719 ins, 154990 del, 72729 sub ]

kaldi

%WER 48.83 [ 62400 / 127790, 6087 ins, 12803 del, 43510 sub ] exp/chain_cleaned/cnn_tdnn_1c_spi_XS/decode_gigaspeech_dev_rnnlm/wer_10_0.5
%WER 48.17 [ 188198 / 390721, 20096 ins, 36896 del, 131206 sub ] exp/chain_cleaned/cnn_tdnn_1c_sp_XS/decode_gigaspeech_test_rnnlm/wer_9_0.5

@pzelasko Do you know why the number of words between snowfall and kaldi in WER is different?

pzelasko commented 3 years ago

Thanks guys for the comments.

@danpovey I think it is easily replicable — I am using the dict and lang directories generated by the Kaldi GigaSpeech recipe. I can help with debugging and provide the relevant info next week — will set up a separate issue discussion for that.

@chenguoguo yes, I might be using an older version from the CLSP grid. If you could coordinate with @jtrmal to get us the latest version, that would be great. BTW should I expect significant differences in the training data for the XS, S and M partitions? Wondering if I’ll need to re-train the model and re-extract the features.

@csukuangfj there are probably two combined reasons for the different number of words: 1) I am using an outdated version of GigaSpeech 2) I haven’t integrated their scoring script which discards fillers like “uhm”, probably incorporating that step will yield lower WER.

dophist commented 3 years ago

@chenguoguo yes, I might be using an older version from the CLSP grid. If you could coordinate with @jtrmal to get us the latest version, that would be great. BTW should I expect significant differences in the training data for the XS, S and M partitions? Wondering if I’ll need to re-train the model and re-extract the features.

Hi Piotr, you don't need to retrain models, in the last update we only moved several audio files from test set to dev set. This does affect evaluation numbers though, so the safe move is to update GigaSpeech.json to latest version(v1.0.0) anyway. I just prepared a download link for you:

wget  https://swaphub.oss-cn-hangzhou.aliyuncs.com/GigaSpeech.json.gz

Also note that all numbers in GigaSpeech paper are effectively based on GigaSpeech.json v1.0.0.

pzelasko commented 3 years ago

I fixed the scoring and GigaSpeech version. I have these results with training on 20s real acoustic context cuts, which seem not too far from Kaldi (especially taking into account smaller decoding LM and no rescoring, whereas Kaldi has RNNLM). The number of words is still slightly different but seems within acceptable limits.

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5714   127774  | 48.5   16.3   35.2    1.4   52.9   98.6 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 49.5   16.0   34.6    1.2   51.7   98.4 |
    |=====================================================================|

Unfortunately I don't have the isolated-utterance training baseline as I made a mess out of my exp dirs -- I will re-run everything anyway, making sure that I report correct numbers for comparisons.

pzelasko commented 3 years ago

Some updates. I evaluated the model on XS in three variants:

"zero context": isolated utterances, no cut concatenation (with bucketing sampler)
"artificial context": isolated utterances with cut concatenation
"real context": 20s cuts with real acoustic context, with one cut created for each supervision in its center -- in this variant, an epoch is approx. 2.5x larger than in isolated utterances because some supervisions that are next to each other may be duplicated. To offset that, I train the isolated utterances models for approx 2.5 more epochs, so that the number of update steps is similar.

I'm only reporting the numbers for the best epochs.

Zero context

EPOCH 30 AVG 5

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5714   127774  | 47.3   17.6   35.1    1.5   54.2   98.7 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 48.3   17.3   34.4    1.3   53.0   98.4 |
    |=====================================================================|

Artificial context

EPOCH 30 AVG 5

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5714   127774  | 48.9   16.5   34.6    1.5   52.6   98.6 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 48.8   16.4   34.8    1.2   52.4   98.4 |
    |=====================================================================|

Real context

EPOCH 15 AVG 5

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5714   127774  | 48.5   16.3   35.2    1.4   52.9   98.6 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 49.5   16.0   34.6    1.2   51.7   98.4 |
    |=====================================================================|

The improvements are modest but might be real given that the test set is large. Notably, there is a much larger portion of deletions compared to Kaldi models.

I am evaluating the model on the S split (officially 250h of transcribed data, but in my setup there seems to be 400h). The models seems to be converging poorly though -- dev loss only got around to 0.265~0.27 and then started growing. The WERs are not great:

S 18 epochs, avg 10, no context, 4 GPUs (max_duration=500s)

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5714   127774  | 56.6    9.0   34.4    1.1   44.5   97.9 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 56.5    9.1   34.4    0.9   44.4   97.5 |
    |=====================================================================|

I'll try training a small alimdl and re-run as I suspect there could be issues with alignment.

danpovey commented 3 years ago

Perhaps you could look at the aligned transcripts, e.g. from write_error_stats(), and see if there is any pattern to the deletions, e.g. do they occur near utterance boundaries?

On Tue, Jul 13, 2021 at 8:18 AM Piotr Żelasko @.***> wrote:

Some updates. I evaluated the model on XS in three variants:

"zero context": isolated utterances, no cut concatenation (with bucketing sampler)

"artificial context": isolated utterances with cut concatenation

"real context": 20s cuts with real acoustic context, with one cut created for each supervision in its center -- in this variant, an epoch is approx. 2.5x larger than in isolated utterances because some supervisions that are next to each other may be duplicated. To offset that, I train the isolated utterances models for approx 2.5 more epochs, so that the number of update steps is similar.

I'm only reporting the numbers for the best epochs. Zero context

EPOCH 30 AVG 5

DEV |--------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+-----------------+-----------------------------------------| | | 5714 127774 | 47.3 17.6 35.1 1.5 54.2 98.7 | |====================================================================|

TEST |---------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+------------------+-----------------------------------------| | | 19930 390744 | 48.3 17.3 34.4 1.3 53.0 98.4 | |=====================================================================|

Artificial context

EPOCH 30 AVG 5

DEV |--------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+-----------------+-----------------------------------------| | | 5714 127774 | 48.9 16.5 34.6 1.5 52.6 98.6 | |====================================================================|

TEST |---------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+------------------+-----------------------------------------| | | 19930 390744 | 48.8 16.4 34.8 1.2 52.4 98.4 | |=====================================================================|

Real context

EPOCH 15 AVG 5

DEV |--------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+-----------------+-----------------------------------------| | | 5714 127774 | 48.5 16.3 35.2 1.4 52.9 98.6 | |====================================================================|

TEST |---------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+------------------+-----------------------------------------| | | 19930 390744 | 49.5 16.0 34.6 1.2 51.7 98.4 | |=====================================================================|

The improvements are modest but might be real given that the test set is large. Notably, there is a much larger portion of deletions compared to Kaldi models.

I am evaluating the model on the S split (officially 250h of transcribed data, but in my setup there seems to be 400h). The models seems to be converging poorly though -- dev loss only got around to 0.265~0.27 and then started growing. The WERs are not great:

S 18 epochs, avg 10, no context, 4 GPUs (max_duration=500s)

DEV |--------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+-----------------+-----------------------------------------| | | 5714 127774 | 56.6 9.0 34.4 1.1 44.5 97.9 | |====================================================================|

TEST |---------------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+------------------+-----------------------------------------| | | 19930 390744 | 56.5 9.1 34.4 0.9 44.4 97.5 | |=====================================================================|

I'll try training a small alimdl and re-run as I suspect there could be issues with alignment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-878682240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZZ6VKTK2SNA4A6B7LTXOA6HANCNFSM5ABTS5CQ .

pzelasko commented 3 years ago

It looks like the deletions are uniformly dispersed over the whole utterances, e.g. this looks fairly representative of the whole dev set:

(THEY STILL->*) PAY (YOU WELL THE->*) FACT (THAT THE STOCK->*) CONTINUES (TO->*) GO (UP->*) IS GREAT PARTICULARLY FOR PEOPLE (WHO->*) HAVE OPTIONS (->*)

pzelasko commented 3 years ago

Alimdl doesn't seem to help, the WER is still above 40% after 9 epochs of training with S (compared to Kaldi's ~22%).

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5715   127790  | 54.4   10.7   34.9    1.1   46.7   98.1 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 54.5   10.5   35.0    0.8   46.3   97.9 |
    |=====================================================================|

Also the XS numbers were too optimistic --I discovered an issue where the GigaSpeech partitions had about 60% more data than they should have. The numbers after fixing the partitions are (compared to Kaldi's ~48%):

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5714   127774  | 45.8   18.4   35.8    1.3   55.5   98.7 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 46.1   17.7   36.2    1.1   55.0   98.6 |
    |=====================================================================|

pzelasko commented 3 years ago

Interestingly, fine-tuning the LibriSpeech model that gets about 4.6% test-clean WER with S (250h) partition doesn't help too much either. Without any tuning, it achieves about 51-52% on GigaSpeech DEV/TEST. I tried fine-tuning it, with different learning rates, for a few epochs but I only got to ~43% on DEV/TEST with it.

There's still a large number of deletions. I looked a bit at the posteriors but no clue there -- they don't seem to be "broken" in any way. I wonder if maybe down-weighting the optional silence in the lexicon would help. I might also try training it together with the attention decoder. I'll also try and train with M, in the somewhat unlikely case that this is just a "small data issue".

chenguoguo commented 3 years ago

Did you apply text normalization before scoring? That might make some difference when you compare with results from other toolkits, see here: https://github.com/SpeechColab/GigaSpeech#text-post-processing-before-scoring and example scoring script: https://github.com/SpeechColab/GigaSpeech/blob/main/utils/gigaspeech_scoring.py

Guoguo

On Thu, Jul 15, 2021 at 6:31 PM Piotr Żelasko @.***> wrote:

Interestingly, fine-tuning the LibriSpeech model that gets about 4.6% test-clean WER with S (250h) partition doesn't help too much either. Without any tuning, it achieves about 51-52% on GigaSpeech DEV/TEST. I tried fine-tuning it, with different learning rates, for a few epochs but I only got to ~43% on DEV/TEST with it.

There's still a large number of deletions. I looked a bit at the posteriors but no clue there -- they don't seem to be "broken" in any way. I wonder if maybe down-weighting the optional silence in the lexicon would help. I might also try training it together with the attention decoder. I'll also try and train with M, in the somewhat unlikely case that this is just a "small data issue".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-881111947, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ57E3NXTOLLICLK4R7N7TTX6DXTANCNFSM5ABTS5CQ .

pzelasko commented 3 years ago

Yeah I am using your scoring script. I will double check if I wired everything correctly though.

pzelasko commented 3 years ago

The model trained on M data (1000h) for 3 epochs is also giving 46% WER on DEV and TEST, I'll take a second (tenth) look at the data and scoring pipelines.

danpovey commented 3 years ago

I suppose it's possible that we checked in something at some point that broke the Librispeech scripts. Or there was some library change that needs calling-code changes in the scripts, and the timing meant that you lost those changes.

danpovey commented 3 years ago

It outputs detailed error stats, perhaps you can see if any particular words are always deleted ?

danpovey commented 3 years ago

Also, what

I have an error in graph compilation when I attempt decoding -- could it be too large?

2021-07-09 09:56:24,627 DEBUG [mmi_att_transformer_decode.py:490] Loading L_disambig.fst.txt
2021-07-09 09:56:32,290 DEBUG [mmi_att_transformer_decode.py:493] Loading G.fst.txt
2021-07-09 09:57:44,288 INFO [graph.py:49] Intersecting L and G
2021-07-09 10:02:10,903 INFO [graph.py:51] LG shape = (153854092, None)
2021-07-09 10:02:10,903 INFO [graph.py:52] Connecting L*G
2021-07-09 10:02:10,904 INFO [graph.py:54] LG shape = (153854092, None)
2021-07-09 10:02:10,904 INFO [graph.py:55] Determinizing L*G
2021-07-09 10:09:50,677 INFO [graph.py:57] LG shape = (122341660, None)
2021-07-09 10:09:50,678 INFO [graph.py:58] Connecting det(L*G)
2021-07-09 10:09:50,678 INFO [graph.py:60] LG shape = (122341660, None)
2021-07-09 10:09:50,678 INFO [graph.py:61] Removing disambiguation symbols on L*G
2021-07-09 10:09:53,055 INFO [graph.py:67] Removing epsilons
[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/tensor.cu:159:k2::Tensor::Tensor(k2::Dtype, const k2::Shape&, k2::RegionPtr, int32_t) Check failed: int64_t(impl_->byte_offset) + begin_elem * element_size >= 0 (-1938201564 vs. 0)

[ Stack-Trace: ]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x2aab3ff9021c]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Tensor::Tensor(k2::Dtype, k2::Shape const&, std::shared_ptr<k2::Region>, int)+0x6da) [0x2aab3b54a4ea]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Array2<int>::Col(int)+0x13a) [0x2aab3b4f5dda]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(+0x270899) [0x2aab3b4e9899]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Index(k2::RaggedShape&, int, k2::Array1<int> const&, k2::Array1<int>*)+0x1da) [0x2aab3b4ebf4a]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xb7335) [0x2aab3a342335]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x9d750) [0x2aab3a328750]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x1bdaf) [0x2aab3a2a6daf]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/bin/python(PyCFunction_Call+0x58) [0x5555556a82d8]
...

Traceback (most recent call last):
  File "./mmi_att_transformer_decode.py", line 608, in <module>
    main()
  File "./mmi_att_transformer_decode.py", line 498, in main
    HLG = compile_HLG(L=L,
  File "/exp/pzelasko/snowfall/snowfall/decoding/graph.py", line 68, in compile_HLG
    LG = k2.remove_epsilon(LG)
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/fsa_algo.py", line 621, in remove_epsilon
    out_fsa = k2.utils.fsa_from_unary_function_ragged(fsa, ragged_arc, arc_map,
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/utils.py", line 508, in fsa_from_unary_function_ragged
    new_value = index(value, arc_map)
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/ops.py", line 335, in index
    return index_ragged(src, indexes)
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/ops.py", line 283, in index_ragged
    return _k2.index(src, indexes)
RuntimeError: Some bad things happed.

Let's try to fix this error- can you please get this in a debugger and find out, for instance, impl_->byte_offset, begin_elem and element_size?

danpovey commented 3 years ago

.. I think re-running the Librispeech scripts from scratch to make sure they still work would be a good idea too. I looked at the script differences and couldn't see any problems. You said the dev loss started getting worse after some point-- how about the training set loss?

danpovey commented 2 years ago

its the disambig symbol.. words alphabetically after s are deleted.

On Saturday, July 17, 2021, Piotr Żelasko @.***> wrote:

The model trained on M data (1000h) for 3 epochs is also giving 46% WER on DEV and TEST, I'll take a second (tenth) look at the data and scoring pipelines.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-881790596, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3XCTGACA5UPUCM7I3TYDJ4NANCNFSM5ABTS5CQ .

pzelasko commented 2 years ago

its the disambig symbol.. words alphabetically after s are deleted.

Yeah, that was it. Time for some good news -- so far I evaluated the "M" split model (1000h), and after the fix it gives better results than Kaldi, even with a weaker LM:

Kaldi:

%WER 17.96 [ 22955 / 127790, 3871 ins, 4954 del, 14130 sub ] exp/chain_cleaned/cnn_tdnn_1c_sp_M/decode_gigaspeech_dev_rnnlm/wer_11_0.0
%WER 17.53 [ 68508 / 390721, 9490 ins, 14359 del, 44659 sub ] exp/chain_cleaned/cnn_tdnn_1c_sp_M/decode_gigaspeech_test_rnnlm/wer_9_0.0

Conformer 20 epochs, 10 epochs avg, isolated utterances:

DEV
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     |        |  5715   127790  | 87.4    9.8    2.8    4.0   16.7   82.5 |
     |====================================================================|

TEST
    |---------------------------------------------------------------------|
    | SPKR   |  # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
    |--------+------------------+-----------------------------------------|
    |        | 19930    390744  | 86.5   10.3    3.2    3.0   16.5   76.9 |
    |=====================================================================|

I'll re-run some previous experiments with XS and S, then make sure it's possible to run L and/or XL recipes, and then we can probably merge it.

Note: it's possibly of interest that if we hadn't used the official GigaSpeech scoring script (also used by Kaldi and ESPnet) that filters out some fillers, the WERs would have been around 21%. There seems to be quite a lot of conversational markers in this corpus.

pzelasko commented 2 years ago

BTW @danpovey @jtrmal is it possible to host the LM and the lexicon for GigaSpeech on OpenSLR? It would make my life simpler when adding the text/lexicon prep steps...

chenguoguo commented 2 years ago

We found an issue with the v1.0.0 json file: some audio files are missing (when compared with the experiments in the paper). We'll update the json so @Piotr Żelasko @.***> please hold off a little bit on your XL run.

I was gonna ask, @Jan Trmal @.***> Yenda we talked about creating a page on OpenSLR so perhaps we should go ahead and do it, and add those LM and lexicon download links.

Guoguo

On Wed, Jul 21, 2021 at 10:38 AM Piotr Żelasko @.***> wrote:

BTW @danpovey https://github.com/danpovey @jtrmal https://github.com/jtrmal is it possible to host the LM and the lexicon for GigaSpeech on OpenSLR? It would make my life simpler when adding the text/lexicon prep steps...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-884368449, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ57E7AUUESHMAKZQWPKYTTY4AZTANCNFSM5ABTS5CQ .

pzelasko commented 2 years ago

OK cool. I'll attend to other stuff before running XL then. With a little bit of luck, the recent changes in Lhotse will be sufficient to avoid the (quite costly in time and space) step of precomputing the features and train it completely on-the-fly...

BTW @danpovey I remember about the bug -- will check it out also.

jtrmal commented 2 years ago

Yeah, absolutely -- can you send me link to files you want me to be hosted and I will handle it

jtrmal commented 2 years ago

I mean just a link on the grid, e.g. :) Plus do not worry about preparing the description files, I will do it

chenguoguo commented 2 years ago

Great, I'll work on that, together with the models.

On Wed, Jul 21, 2021 at 10:55 AM jtrmal @.***> wrote:

I mean just a link on the grid, e.g. :) Plus do not worry about preparing the description files, I will do it

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-884378254, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ57E5JX3LZ6FAYETBLTVDTY4CXPANCNFSM5ABTS5CQ .

jimbozhang commented 2 years ago

Hi @pzelasko , I added some data and dict preparation scripts to pzelasko:feature/gigaspeech: https://github.com/pzelasko/snowfall/pull/2

And I'm testing this recipe on a server of Xiaomi. Themmi_att_transformer_train is in progress now:

2021-07-23 20:37:44,439 INFO [mmi_att_transformer_train.py:278] batch 130, epoch 0/10 global average objf: 1.323525 over 1609177.0 frames (100.0% kept), current batch average objf: 1.333462 over 12351 frames (100.0% kept) avg time waiting for batch 0.058s
2021-07-23 20:39:04,668 INFO [mmi_att_transformer_train.py:278] batch 140, epoch 0/10 global average objf: 1.321570 over 1732300.0 frames (100.0% kept), current batch average objf: 1.274609 over 12349 frames (100.0% kept) avg time waiting for batch 0.056s
2021-07-23 20:40:32,999 INFO [mmi_att_transformer_train.py:278] batch 150, epoch 0/10 global average objf: 1.321179 over 1855032.0 frames (100.0% kept), current batch average objf: 1.336096 over 12225 frames (100.0% kept) avg time waiting for batch 0.055s

pzelasko commented 2 years ago

Cool! It should be working for all setups now, but for L and XL there will be probably a couple of extra options needed to specify (for prepare.py --precomputed-features 0, for training script --shuffle 0 --check-cuts 0 --on-the-fly-feats 1 and possibly others). I will document it with examples in run.sh when I feel that the recipe is fully usable.

pzelasko commented 2 years ago

Some news:

merged @jimbozhang's PR with dict/LM prep,
documented some options in the recipe a bit better,
created a local asr_dataloading.py file that does dataloading a bit differently than LibriSpeech to make it more efficient for on-the-fly features with GigaSpeech
if you're running L or XL subset, you will be prompted to pip install pyarrow pandas, and to set some specific options for the training script (examples in the comments in run.sh training stage) -- basically the "normal" dataloading mechanisms that read the whole cut set manifest into memory are only efficient for up to ~1000h of speech, above that things become sluggish, so I leverage pyarrow + mmap, I'm not sure if it's the perfect solution but it gets the job done (maybe there are simpler alternatives to explore later)

I'm re-running everything from scratch to make sure the recipe has no issues. Other than not having results for L or XL, or acoustic context windows training, this PR would be OK to merge, and we could add stuff in follow-up PRs as needed.

pzelasko commented 2 years ago

Assuming LM and lexicon is there, I think the whole recipe is able to run in all configurations now. @jimbozhang I am not sure if the steps that you added can run from scratch -- could you double check that? Thanks!

jimbozhang commented 2 years ago

Assuming LM and lexicon is there, I think the whole recipe is able to run in all configurations now. @jimbozhang I am not sure if the steps that you added can run from scratch -- could you double check that? Thanks!

I have run this whole recipe and get the model exp-conformer-mmi-att-sa-vgg-normlayer-gigaXS-/epoch-9.pt trained.

But the decoding fails with the following message: https://paste.ubuntu.com/p/Kvwt3DxrQP/

k2 version: 1.2
Build type: Debug
Git SHA1: eeeabf187aae5fb4bb91dc66dada32a0e555db6c
Git date: Fri Jul 23 15:39:48 2021
Cuda used to build k2: 11.0
cuDNN used to build k2: 8.2.0
Python version used to build k2: 3.8
OS used to build k2: CentOS Linux release 7.3.1611 (Core)
CMake version: 3.21.0
GCC version: 9.3.1
CMAKE_CUDA_FLAGS:  --compiler-options -rdynamic --compiler-options -lineinfo --expt-extended-lambda -gencode arch=compute_70,code=sm_70 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.7.1+cu110
PyTorch is using Cuda: 11.0
NVTX enabled: True
With CUDA: True
Disable debug: False
Sync kernels : False
Disable checks: False

csukuangfj commented 2 years ago

But the decoding fails with the following message: It fails at

Traceback (most recent call last):
File "./mmi_att_transformer_decode.py", line 627, in <module>
main()
File "./mmi_att_transformer_decode.py", line 511, in main
HLG = k2.Fsa.from_dict(d)

Could you check that your HLG.pt is not empty? @jimbozhang

pzelasko commented 2 years ago

I never had this issue, but I might have different k2/cuda/pytorch versions... maybe @csukuangfj or @danpovey would know more to help.

danpovey commented 2 years ago

Can you delete HLG.pt from the lang dir and try again; and check that there are no empty files in that lang dir, or too-small files, such as L, G and so on.

chenguoguo commented 2 years ago

@pzelasko @jtrmal We have fixed the GigaSpeech metadata and downloaded the most recent GigaSpeech.json file to CLSP, see here: /export/b01/guoguo/GigaSpeech/GigaSpeech.json, its md5 is 19c777dc296ff3eb714bc677a80620a3 . We don't have to update the audio files, just this json file. So once you replace the old json with this one, you should be able to run the XL experiments. I didn't bump the GigaSpeech.json version number since this is technically the version that we used in the paper.

pzelasko commented 2 years ago

Thanks @chenguoguo I will look into it.

I just pushed the results for S, M, and L into RESULTS.md. They are:

S: 19.6 / 19.2 %
M: 16.7 / 16.5 %
L: 16.1 / 15.9 %

These are for 20 epoch training and last 10 epochs used for averaging. All these results are with pruned 3-gram model based on the original Kaldi's 4-gram model. Probably there are gains to be made from larger models and better LMs.

I didn't use any rescoring, although I did re-run L with rescoring for num_paths=30, my results seemed worse though (about 17.5% WER). I am not sure what is the reason but it takes quite a long time and I don't want to investigate it right now (maybe somebody else wants to).

I am running XS right now, and will run XL later. I'll try to recompile the 4-gram later to debug the k2 issue I ran into before.

I think we should merge this soon even if not 100% complete, and possibly resolve any issues later... I'll need to stop working on this and take care of other things sometime soon.

chenguoguo commented 2 years ago

We tried LM rescoring with conformer models in other toolkits, which also gave worse results. So this is at least consistent with your experiments but it might be worth looking into.

Keep us updated on the XL results, and we also maintain a leader board here (https://github.com/SpeechColab/GigaSpeech#leaderboard) :-)

Guoguo

On Mon, Jul 26, 2021 at 10:45 AM Piotr Żelasko @.***> wrote:

Thanks @chenguoguo https://github.com/chenguoguo I will look into it.

I just pushed the results for S, M, and L into RESULTS.md. They are:

S: 19.6 / 19.2 %

M: 16.7 / 16.5 %

L: 16.1 / 15.9 %

These are for 20 epoch training and last 10 epochs used for averaging. All these results are with pruned 3-gram model based on the original Kaldi's 4-gram model.

I didn't use any rescoring, although I did re-run L with rescoring for num_paths=30, my results seemed worse though (about 17.5% WER). I am not sure what is the reason but it takes quite a long time and I don't want to investigate it right now (maybe somebody else wants to).

I am running XS right now, and will run XL later. I'll try to recompile the 4-gram later to debug the k2 issue I ran into before.

I think we should merge this soon even if not 100% complete, and possibly resolve any issues later... I'll need to stop working on this and take care of other things sometime soon.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/230#issuecomment-886901046, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ57E3LSSWBW3RCT6TY5QDTZWNLFANCNFSM5ABTS5CQ .

pzelasko commented 2 years ago

Oooh, a leaderboard -- now I can't resist 😈

danpovey commented 2 years ago

Cool! Let us know when you want us to merge this. I don't think we need to be super careful, given that we are currently rewriting some things for icefall.

pzelasko commented 2 years ago

I will make one more change where I want to remove the apache arrow stuff, it's unnecessarily complex and I figured out a way to use samplers with regular JSONL manifests read sequentially from disk. It required some changes in Lhotse. I should be ready sometime this week (maybe even today).

pzelasko commented 2 years ago

Let me do it the other way around -- I will merge it now, and then make other changes. I'm afraid of making any more changes within this PR ;)

k2-fsa / snowfall

Gigaspeech recipe #230

Zero context

Artificial context

Real context