Validation-Loss is infinite when running egs/aishell/conformer-mmi

davidlin409 commented 1 year ago

Dear all:

Recently I tries to adopt LF-MMI in training AISHELL dataset inside model "tdnn-lstm", and I found that validation error is always infinity, while training-loss is perfectly normal.

In order to debug it, I tries to run "conformer-mmi" script in egs/aishell, to ensure that there is no programming error that I introduce (I only change validation interval to 1000, from original 3000). The training result still shows that in egs/aishell/conformer-mmi training script, validation-loss is still infinity. So it means that original script has already the issue of "validation error to be infiinity".

Commit hash for "icefall" is a7fbb18b.

Curently I am out of direction in debugging this issue, so don't know if there is anything that I am missing?

davidlin409 commented 1 year ago

From some experiment, it shows that some validation-loss is infinity might due to "HLG" graph (used in training) might not reach the end-state in FSA, causing get_total_loss to be -inf.

Loss is $-1 \times get\_total\_loss$, so resulting validation-loss is infinity. To resolve the issue, one simple solution is to include validation-set transcripts in building "bigram P" in AISHELL example, where there is no infinity-loss anymore.

However, even in LibriSpeech example, validation-set transcripts are also not included in building "bigram P". So is "including validation-set transcripts into bigram P building process" legit way to do it?

PS: Here "HLG" does not mean generated "HLG.pt" in prepare stage, it is composition of "HP & L & train-transcript".

yaozengwei commented 1 year ago

We don't have HLG in MMI training. It is HP (H is ctc-topo, P is token-level bigram). I think it would cause data leakage if we include validation-set transcripts in building the bigram P.
One option is to try our new mmi-recipe (https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer_mmi), in which we make some updates, e.g., use CTC warmup to stabilize training, and remove the attention head.

davidlin409 commented 1 year ago

For NN model, I am using TDNN-F without LSTM, so attention related changes might not fit to my situation.

I just tried to use warm-up stratigy to see if it helps. Result turns out that validation-loss is still with lots of infinity (>95%).

Indeed that using validation-set transcripts have concern of "data leakage". However, since MMI's HP graph is to intersect H with bigram-P, the process will only keep phoneme transition that is available in training data; while there might be possibility that part of transition in validation-set is removed, which causes infinity issue.

I also observed that, during training, ratio of "validation-set loss being infinity" remains constant through-out entire training period, that might also support the theory of "no end-state causing loss to be infinity"?

On the other hand, when I tries to use dictionary (to build lexicon) from AISHELL, it causes lots of infinity in validation-set loss; when CEDict is used to generate lexicon, there is no validtion-set loss being infinity issue. So maybe it is because of lots of same words with different pronunciation in AISHELL dictionary?

csukuangfj commented 1 year ago

while there might be possibility that part of transition in validation-set is removed

There are backoff arcs in P and H is fully connected. Are you able to find an utterance from the validation set that has an empty output when it is intersected with HP?

davidlin409 commented 1 year ago

I can tries if I can catch valdation HP graph using AISHELL dictioanry (But I might be able to do it next Monday, since I am having some days-off).

yaozengwei commented 1 year ago

For NN model, I am using TDNN-F without LSTM, so attention related changes might not fit to my situation.

The attention head I metioned is the transformer decoder branch when you use --att-rate. Now we don't have att_loss in the new zipformer-mmi recipe.

davidlin409 commented 1 year ago

Hi @csukuangfj

For validation-data with infinit validation-loss, I tries to obtain numerator/denominator graph (using MMI-graph) by first obtaining transcripts that causes validation-loss to be infinity, and tries to rebuild num/den graph from corresponding transcripts.

During the process, I did found one bug in my preparation stage. After fixing the issue, I still got infinit validation-loss in around half of all batches. In each batch, the number of validation-loss being infinit is actually only 1/34 of the whole batch.

What I found out is that, validation-loss being infinit is strongly related to the situation that, corresponding transcript contains UNK word(s). For transcript without any UNK, validation-loss is normal. I tries 600/1000/1600 batches, and I can repeat that observation.

So maybe the issue of "validation-loss being infinity" is due to UNK path not available during HP graph construction?

I attach resulting num-graph of the batch with infinit validation-loss, and logging of some statistics that helps to get conclusion above, as shown below:

attachment_david.zip

PS: The reason UNK appears in validation-data is due to the reason that, I restrict dictionary to only words that appears inside training-data, and some words in validation-data is not shown in training-data. This method is originated from old Kaldi setup, and I may try to use whole dictionary instead.

csukuangfj commented 1 year ago

[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1668091213224/work/k2/csrc/array.h:177:k2::Array1<T> k2::Array1<T>::Arange(int32_t, int32_t) const [with T = k2::Any; int32_t = int] Check failed: start <= dim_ (19656 vs. 3906) 

[ Stack-Trace: ]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/k2/lib/libk2_log.so(k2::internal::GetStackTrace()+0x47) [0x7f4c74c8d707]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x2c84a) [0x7f4c7b25d84a]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x15d531) [0x7f4c7b38e531]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x1392a3) [0x7f4c7b36a2a3]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x130228) [0x7f4c7b361228]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x24aac) [0x7f4c7b255aac]
python() [0x4fe407]
python(_PyObject_MakeTpCall+0x25b) [0x4f7e1b]
python() [0x50a2df]
python(_PyEval_EvalFrameDefault+0x13b3) [0x4f0033]
python() [0x543c42]
python() [0x543aa8]
python(_PyEval_EvalFrameDefault+0xb9e) [0x4ef81e]
python() [0x594bb2]
python(PyEval_EvalCode+0x87) [0x594af7]
python() [0x5c6e07]
python() [0x5c1ce0]
python() [0x45ae5b]
python(_PyRun_SimpleFileObject+0x19f) [0x5bc1ff]
python(_PyRun_AnyFileObject+0x43) [0x5bc003]
python(Py_RunMain+0x38d) [0x5b8e1d]
python(Py_BytesMain+0x39) [0x587c69]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f4d0d3bc0b3]
python() [0x587b1e]

Idx 33 - Transcript = 从 德国 寄 来 的 信封 里 是 三条 泡 着 福尔马 林 的 鲤 鱼
Idx 33 - Ratio of words not in symbol table = 1 / 16

Could you use pdb to find out which statement causes the above exception?

We have considered OOV words during construction of HP. The above exception may be caused by some bug.

davidlin409 commented 1 year ago

Could you use pdb to find out which statement causes the above exception?

Previously I hide message from RuntimeError exception. Full traceback is as follows (using PDB):

Idx 0 - Transcript = 国家 支持 民营 企业 充分 利用 新型 金融 工具 融资
Idx 0 - Ratio of words not in symbol table = 0 / 10
Idx 1 - Transcript = 该 公司 将 开放 数千 项 氢 和 燃料电池 相关 专利
Idx 1 - Ratio of words not in symbol table = 0 / 11
Idx 2 - Transcript = 但是 改善 型 置业 入市 积极性 下降
Idx 2 - Ratio of words not in symbol table = 0 / 7
Idx 3 - Transcript = 中国 仍 为 世界上 增长 最 快 的 大型 经济 体 之一
Idx 3 - Ratio of words not in symbol table = 0 / 12
Idx 4 - Transcript = 因此 北京 未来 仍 主要 需要 依靠 第三产业 发展
Idx 4 - Ratio of words not in symbol table = 0 / 9
Idx 5 - Transcript = 本 次 跑步 活动 挑战 难度 相比 以往 全面 升级
Idx 5 - Ratio of words not in symbol table = 0 / 10
Idx 6 - Transcript = 要求 大赛 的 裁判 长 来 进行 最终 裁决
Idx 6 - Ratio of words not in symbol table = 0 / 9
Idx 7 - Transcript = 尤其 肉类 要 经过 体 科 所 的 检验 才能 放心 进入 厨房
Idx 7 - Ratio of words not in symbol table = 0 / 13
Idx 8 - Transcript = 答 海水淡化 是 从 海水 中 提取 淡水
Idx 8 - Ratio of words not in symbol table = 0 / 8
Idx 9 - Transcript = 国乒 在 欧洲 还有 瓦 尔 德内 尔 这样的 对手
Idx 9 - Ratio of words not in symbol table = 0 / 10
Idx 10 - Transcript = 她 谈 起 了 自己 的 新 歌 希望 可以 传递 正 能量
Idx 10 - Ratio of words not in symbol table = 0 / 13
Idx 11 - Transcript = 这样的 孵化 平台 也 成为 极客 与 资本 沟通 联手 的 桥梁
Idx 11 - Ratio of words not in symbol table = 0 / 12
Idx 12 - Transcript = 而 房地产 企业 获得 的 收入 又是 持续 升值 的 人民币
Idx 12 - Ratio of words not in symbol table = 0 / 11
Idx 13 - Transcript = 价格 在 三 万元 的 二手房 购房 个案 当中
Idx 13 - Ratio of words not in symbol table = 0 / 9
Idx 14 - Transcript = 广州市 公积金 中心 虽然 没有 正式 下文
Idx 14 - Ratio of words not in symbol table = 0 / 7
Idx 15 - Transcript = 因为 我 一个人 的 失误 而 让 我们 丢掉 这么 宝贵
Idx 15 - Ratio of words not in symbol table = 0 / 11
Idx 16 - Transcript = 竞买 申请人 须 在 南沙区 注册 成立 项目 公司
Idx 16 - Ratio of words not in symbol table = 0 / 9
Idx 17 - Transcript = 对手 现阶段 已经 达到 了 冲击 国乒 主力 的 程度
Idx 17 - Ratio of words not in symbol table = 0 / 10
Idx 18 - Transcript = 搜狐 娱乐 讯 据 香港 媒体 报道
Idx 18 - Ratio of words not in symbol table = 0 / 7
Idx 19 - Transcript = 但 同比 仍然 显示 出 上升趋势 而 截至 目前
Idx 19 - Ratio of words not in symbol table = 0 / 9
Idx 20 - Transcript = 所以 我 在 进入 决赛 跑道 的时候 我 对 我的 队友 说
Idx 20 - Ratio of words not in symbol table = 0 / 12
Idx 21 - Transcript = 却 在 不经意间 成了 话题 中心 一 条 几百 字 的 微博
Idx 21 - Ratio of words not in symbol table = 0 / 12
Idx 22 - Transcript = 生产 日期 有 涂改 且 包装 错 乱
Idx 22 - Ratio of words not in symbol table = 0 / 8
Idx 23 - Transcript = 希望 在 大 中 城市 定居 的 农民工 占 到 一半 以上
Idx 23 - Ratio of words not in symbol table = 0 / 12
Idx 24 - Transcript = 曝 陈冠希 怒 摔 插队 大叔 身份证 现已 和解
Idx 24 - Ratio of words not in symbol table = 0 / 9
Idx 25 - Transcript = 现场 照片 法 晚 深度 即时 一零 月 二 二 日
Idx 25 - Ratio of words not in symbol table = 0 / 11
Idx 26 - Transcript = 于 二零一三 年 成为 中国 最大 的 芯片 设计 企业
Idx 26 - Ratio of words not in symbol table = 0 / 10
Idx 27 - Transcript = 医疗 服务 价格 长期 低于 成本 且 未能 进行 动态 调整
Idx 27 - Ratio of words not in symbol table = 0 / 11
Idx 28 - Transcript = 由于 开赛 前 惠若琪 临时 缺阵
Idx 28 - Ratio of words not in symbol table = 0 / 6
Idx 29 - Transcript = 发现 文化 艺术 出版社 冒名 出版 悬崖 边 的 辩护
Idx 29 - Ratio of words not in symbol table = 0 / 10
Idx 30 - Transcript = 这 是 福星 布局 北京 的 第 一个 项目
Idx 30 - Ratio of words not in symbol table = 0 / 9
Idx 31 - Transcript = 但 白宫 经过 了 几个月 的 研究 之后
Idx 31 - Ratio of words not in symbol table = 0 / 8
Idx 32 - Transcript = 四 部 宣传 片 看似 没 联系 实际 讲 故事
Idx 32 - Ratio of words not in symbol table = 0 / 10
[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1668091213224/work/k2/csrc/array.h:177:k2::Array1<T> k2::Array1<T>::Arange(int32_t, int32_t) const [with T = k2::Any; int32_t = int] Check failed: start <= dim_ (16904 vs. 2912) 

[ Stack-Trace: ]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/k2/lib/libk2_log.so(k2::internal::GetStackTrace()+0x47) [0x7fa4bffa1707]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x2c84a) [0x7fa4c657184a]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x15d531) [0x7fa4c66a2531]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x1392a3) [0x7fa4c667e2a3]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x130228) [0x7fa4c6675228]
/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x24aac) [0x7fa4c6569aac]
python() [0x4fe407]
python(_PyObject_MakeTpCall+0x25b) [0x4f7e1b]
python() [0x50a2df]
python(_PyEval_EvalFrameDefault+0x13b3) [0x4f0033]
python() [0x543c42]
python() [0x543aa8]
python(_PyEval_EvalFrameDefault+0xb9e) [0x4ef81e]
python() [0x594bb2]
python(PyEval_EvalCode+0x87) [0x594af7]
python() [0x59bfdd]
python() [0x4fea34]
python() [0x4e89c5]
python(_PyEval_EvalFrameDefault+0x72bc) [0x4f5f3c]
python() [0x594bb2]
python(PyEval_EvalCode+0x87) [0x594af7]
python() [0x59bfdd]
python() [0x4fea34]
python() [0x4e89c5]
python(_PyEval_EvalFrameDefault+0x72bc) [0x4f5f3c]
python(_PyFunction_Vectorcall+0x6f) [0x4fe84f]
python(_PyEval_EvalFrameDefault+0x731) [0x4ef3b1]
python(_PyFunction_Vectorcall+0x6f) [0x4fe84f]
python(_PyEval_EvalFrameDefault+0x731) [0x4ef3b1]
python(_PyFunction_Vectorcall+0x6f) [0x4fe84f]
python(_PyEval_EvalFrameDefault+0x4b4e) [0x4f37ce]
python() [0x594bb2]
python(PyEval_EvalCode+0x87) [0x594af7]
python() [0x59bfdd]
python() [0x4fea34]
python(_PyEval_EvalFrameDefault+0x31f) [0x4eef9f]
python(_PyFunction_Vectorcall+0x6f) [0x4fe84f]
python(_PyEval_EvalFrameDefault+0x31f) [0x4eef9f]
python(_PyFunction_Vectorcall+0x6f) [0x4fe84f]
python() [0x5b906f]
python(Py_RunMain+0xc2) [0x5b8b52]
python(Py_BytesMain+0x39) [0x587c69]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fa55873f0b3]
python() [0x587b1e]

Traceback (most recent call last):
  File "/home/davidlin409/miniconda3/envs/k22/lib/python3.10/pdb.py", line 1726, in main
    pdb._runscript(mainpyfile)
  File "/home/davidlin409/miniconda3/envs/k22/lib/python3.10/pdb.py", line 1586, in _runscript
    self.run(statement)
  File "/home/davidlin409/miniconda3/envs/k22/lib/python3.10/bdb.py", line 597, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/home/davidlin409/Workspace/ite-icefall-center-testing/egs/aishell/test.py", line 45, in <module>
    item = num[idx]     # This is where error happens
  File "/home/davidlin409/miniconda3/envs/k22/lib/python3.10/site-packages/k2/fsa.py", line 1029, in __getitem__
    value.arange(axis=0, begin=start, end=end))
RuntimeError: 
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

Is this information sufficient?

The script used to generate RuntimeError above is test.zip.

davidlin409 commented 1 year ago

When I check P-gram generated from K2-CTC, indeed there are back-off path from any symbol back to blank.

But I have a thought about what might happen (don't know if it is correct or not):

In LF-MMI training stage, CTC-P graph is intersecting with P-gram, where P-gram is generated from phone-sequence, which is translated from transcripts via provided lexicon.

So in a situration that in one lexicon, there are no OOV in training trainscript, while there are OOV in validation transcripts. LF-MMI is trying to combine P-gram with CTC-P, where P-gram generated from training trainscript doesn't have OOV path; Resulting intersected P-gram (combined from transcript P-gram and CTC-P) might cause OOV to be removed.

And at this time, validation transcirpt contains OOV; When the given validation transcript is composed with P-gram without OOV, resulting final training-graph (HP graph) might be empty. Therefore, loss computing from it might be $-\infty$. Maybe that is the cause of validation-loss being infinity?

On the other hand, the bug shown above happens when I tries to get empty FST (or FSA) from FSA/FST array. Also, bug trace shows that the bug happens at API __getitem__ in "k2/fsa.py".

k2-fsa / icefall

Validation-Loss is infinite when running egs/aishell/conformer-mmi #734