k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
477 stars 97 forks source link

Streaming server outputs only phonemes even with LG #499

Open kasidis-kanwat opened 8 months ago

kasidis-kanwat commented 8 months ago

I'm using a phone-based zipformer but I could not get the server to output graphemes despite the fact that I'm providing an LG graph to both cpp API and python API.

These are what I tried.

sherpa-online-websocket-server \
  --decoding-method=fast_beam_search \
  --nn-model=../model_v1/jit_script_chunk_64_left_128.pt \
  --lg=/workdir/Desktop/sherpa/model_v1/lang_phone2/LG.pt \
  --tokens=../model_v1/lang_phone2/tokens.txt \
  --port=5051 \
  --decode-chunk-size=32 \
  --decode-left-context=128 \
  --doc-root=./sherpa/bin/web \
  --ngram-lm-scale=0.3

python3 ./sherpa/bin/streaming_server.py \
  --port=5051 \
  --decoding-method=fast_beam_search \
  --LG=../model_v1/lang_phone2/LG.pt \
  --nn-model=../model_v1/jit_script_chunk_64_left_128.pt \
  --tokens=../model_v1/lang_phone2/tokens.txt \
  --ngram-lm-scale=0.3
csukuangfj commented 8 months ago

Could you post what the above commands output?

kasidis-kanwat commented 8 months ago

There are no errors actually. Everything works perfectly except the fact that the predicted text is in phoneme.

For example, output from sherpa-online-websocket-server

[I] /workdir/Desktop/sherpa/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-11-07 08:39:54.760 sherpa-online-websocket-server --decoding-method=fast_beam_search --nn-model=../model_v1/jit_script_chunk_64_left_128.pt --lg=/workdir/Desktop/sherpa/model_v1/lang_phone2/LG.pt --tokens=../model_v1/lang_phone2/tokens.txt --port=5051 --decode-chunk-size=32 --decode-left-context=128 --doc-root=./sherpa/bin/web --ngram-lm-scale=0.3

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/online-recognizer.cc:498:void sherpa::OnlineRecognizer::OnlineRecognizerImpl::WarmUp() 2023-11-07 08:39:55.314 WarmUp begins
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/online-recognizer.cc:521:void sherpa::OnlineRecognizer::OnlineRecognizerImpl::WarmUp() 2023-11-07 08:39:55.388 WarmUp ended
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:81:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560 Listening on: 5051

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:83:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560 Number of work threads: 5

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:119:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560
Please access the HTTP server using the following address:

http://localhost:5051

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server-impl.cc:272:void sherpa::OnlineWebsocketServer::OnOpen(connection_hdl) 2023-11-07 08:40:17.534 New connection: 127.0.0.1:47978. Number of active connections: 1.

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server-impl.cc:279:void sherpa::OnlineWebsocketServer::OnClose(connection_hdl) 2023-11-07 08:40:17.952 Number of active connections: 0

output from sherpa-online-websocket-client

[I] /workdir/Desktop/sherpa/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-11-07 08:40:17.531 sherpa-online-websocket-client --server-port=5051 processed.wav 

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:223:void Client::SendMessage(websocketpp::connection_hdl, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >) 2023-11-07 08:40:17.534 Starting to send audio
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:261:void Client::SendMessage(websocketpp::connection_hdl, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >) 2023-11-07 08:40:17.534 Sent Done Signal
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.866 {"final":false,"segment":0,"start_time":0.0,"text":"pqq1t^","timestamps":[0.9599999785423279,1.0,1.0799999237060547],"tokens":["p","qq1","t^"]}
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.951 {"final":true,"segment":0,"start_time":0.0,"text":"pqq1t^fa0j^duua2j^","timestamps":[0.9599999785423279,1.0,1.0799999237060547,1.2799999713897705,1.4399999380111694,1.5199999809265137,1.6799999475479126,1.7599999904632568,2.200000047683716],"tokens":["p","qq1","t^","f","a0","j^","d","uua2","j^"]}
processed.wavpqq1t^fa0j^duua2j^
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:103:Client::Client(asio::io_context&, const string&, int16_t, const string&, float, int32_t, std::string)::<lambda(websocketpp::connection_hdl)> 2023-11-07 08:40:17.952 Disconnected
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:337:int32_t main(int32_t, char**) 2023-11-07 08:40:17.952 Done!
csukuangfj commented 8 months ago

Can you check that your LG is correct?

Have you tested the LG and pre-trained models provided by us in the doc?

kasidis-kanwat commented 8 months ago

Can you check that your LG is correct?

I have used this LG graph to replace trivial_graph in zipformer/decode.py and it correctly output grapheme. Is there another preferred way to verify the correctness?

fast_beam_search_nbest
00d0fb48e21f4b0086ac94eb7723f150-27530: ref=['ph', 'aa2', 'p^', 'j', 'a1', 'j^', 'kh', 'vv0', 'z^']
00d0fb48e21f4b0086ac94eb7723f150-27530: hyp=['ph', 'aa2', 'p^', 'j', 'a1', 'j^', 'kh', 'vv0', 'z^']

fast_beam_search_nbest_LG
00d0fb48e21f4b0086ac94eb7723f150-27530: ref=['ภาพ', 'ใหญ่', 'คือ']
00d0fb48e21f4b0086ac94eb7723f150-27530: hyp=['ภาพ', 'ใหญ่', 'คือ']

Have you tested the LG and pre-trained models provided by us in the doc?

I tested two models and it seems to be working.

icefall-asr-librispeech-streaming-zipformer-2023-05-17

./test_wavs/1221-135766-0002.wav                                                                                                                                                                            
 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION                                                                                                                                 
{"final":true,"segment":0,"start_time":0.0,"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":[0.6399999856948853,0.7199999690055847,0.9599999785423279,1.03$9999618530273,1.1999999284744263,1.399999976158142,1.5999999046325684,1.6799999475479126,1.71999990940094,1.7999999523162842,1.8399999141693115,2.0399999618530273,2.119999885559082,2.2799999713897705,2.4$0000057220459,2.5199999809265137,2.6399998664855957,2.679999828338623,2.919999837875366,2.9600000381469727,3.240000009536743,3.4800000190734863,3.6399998664855957,3.879999876022339,4.159999847412109,4.27$999732971191,4.319999694824219,4.519999980926514,4.599999904632568,4.679999828338623,4.759999752044678],"tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N$,"NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}

icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming

./test_wavs/DEV_T0000000000.wav
对我介绍我想
{"final":true,"segment":0,"start_time":0.0,"text":"对我介绍我想","timestamps":[0.47999998927116394,0.5999999642372131,1.0399999618530273,1.159999966621399,2.2799999713897705,2.3999998569488525],"tokens":["对","我","介","绍","我","想"]}

./test_wavs/DEV_T0000000001.wav
重点三个问题首先表现
{"final":true,"segment":0,"start_time":0.0,"text":"重点三个问题首先表现","timestamps":[0.35999998450279236,0.4399999976158142,1.0799999237060547,1.2400000095367432,1.399999976158142,1.6399999856948853,2.319999933242798,2.4800000190734863,4.679999828338623,4.880000114440918],"tokens":["重","点","三","个","问","题","首","先","表","现"]}

./test_wavs/DEV_T0000000002.wav
分析这一次全球进动脑
{"final":true,"segment":0,"start_time":0.0,"text":"分析这一次全球进动脑","timestamps":[1.1200000047683716,1.399999976158142,1.7999999523162842,2.0,2.240000009536743,2.759999990463257,2.879999876022339,3.0799999237060547,3.2799999713897705,3.4800000190734863],"tokens":["分","析","这","一","次","全","球","进","动","脑"]}
csukuangfj commented 8 months ago

For the following output:

[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void 
Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.951 {"final":true,"segment":0,"start_time":0.0,
"text":"pqq1t^fa0j^duua2j^",
"timestamps":[0.9599999785423279,1.0,1.0799999237060547,1.2799999713897705,1.4399999380111694,1.5199999809265137,1.6799999475479126,1.7599999904632568,2.200000047683716],
"tokens":["p","qq1","t^","f","a0","j^","d","uua2","j^"]}

what are your expected text and tokens?

kasidis-kanwat commented 8 months ago

I would expect the text to be the grapheme of "pqq1t^fa0j^duua2j^", i.e., "เปิด ไฟ ด้วย" since my lexicon looks something like this:

...
เปิด p qq1 t^
ไฟ f a0 j^
ด้วย d uua2 j^
...

As for tokens, I'm uncertain but I think it should probably be phoneme since the model was trained to predict phonemes.

kasidis-kanwat commented 7 months ago

@csukuangfj may I inquire the status of this issue? Thank you.

csukuangfj commented 7 months ago

I'm sorry for not getting back to you sooner.

I see the problem now.

During decoding, we save only the decoded tokens https://github.com/k2-fsa/sherpa/blob/a34c2c83bc07ad5f99f44313bf12fa36017ebe17/sherpa/csrc/online-transducer-fast-beam-search-decoder.cc#L133

It is not a problem for BPE-based models since we can get the correct words by just concatenating all the BPE tokens and removing _ with a space afterward.

For non-BPE-based models, we need to also save the word_ids during decoding and get the text from the word_ids

We also need to pass words.txt so that we can map word IDs to strings.

kasidis-kanwat commented 7 months ago

Thank you for responding so quickly. I will attempt to implement the fix as soon as I have the time.

kerolos commented 2 months ago

Hello @kasidis-kanwat @csukuangfj : I will really appreciate if you can answer my questions: 1) Which Zipformer model is good for Phone-Based lexicon (with an acceptable result as BPE), which one and recipe do you recommended and are there any changes in the Model parameters ? (this tiny model seems that it does not have a good CER egs/librispeech/ASR/tiny_transducer_ctc) 2) con we converted it to ONNX_int8 model Sherpa? 3) Is it possible to Decode with LM 4) Does it cover the new word feature Contextual biasing (Hotwords) ? 5) Does it cover multiple variant transcriptions per word ?

Thanks in advance,