Open kasidis-kanwat opened 8 months ago
Could you post what the above commands output?
There are no errors actually. Everything works perfectly except the fact that the predicted text is in phoneme.
For example,
output from sherpa-online-websocket-server
[I] /workdir/Desktop/sherpa/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-11-07 08:39:54.760 sherpa-online-websocket-server --decoding-method=fast_beam_search --nn-model=../model_v1/jit_script_chunk_64_left_128.pt --lg=/workdir/Desktop/sherpa/model_v1/lang_phone2/LG.pt --tokens=../model_v1/lang_phone2/tokens.txt --port=5051 --decode-chunk-size=32 --decode-left-context=128 --doc-root=./sherpa/bin/web --ngram-lm-scale=0.3
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/online-recognizer.cc:498:void sherpa::OnlineRecognizer::OnlineRecognizerImpl::WarmUp() 2023-11-07 08:39:55.314 WarmUp begins
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/online-recognizer.cc:521:void sherpa::OnlineRecognizer::OnlineRecognizerImpl::WarmUp() 2023-11-07 08:39:55.388 WarmUp ended
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:81:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560 Listening on: 5051
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:83:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560 Number of work threads: 5
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server.cc:119:int32_t main(int32_t, char**) 2023-11-07 08:39:55.560
Please access the HTTP server using the following address:
http://localhost:5051
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server-impl.cc:272:void sherpa::OnlineWebsocketServer::OnOpen(connection_hdl) 2023-11-07 08:40:17.534 New connection: 127.0.0.1:47978. Number of active connections: 1.
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-server-impl.cc:279:void sherpa::OnlineWebsocketServer::OnClose(connection_hdl) 2023-11-07 08:40:17.952 Number of active connections: 0
output from sherpa-online-websocket-client
[I] /workdir/Desktop/sherpa/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char* const*) 2023-11-07 08:40:17.531 sherpa-online-websocket-client --server-port=5051 processed.wav
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:223:void Client::SendMessage(websocketpp::connection_hdl, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >) 2023-11-07 08:40:17.534 Starting to send audio
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:261:void Client::SendMessage(websocketpp::connection_hdl, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >) 2023-11-07 08:40:17.534 Sent Done Signal
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.866 {"final":false,"segment":0,"start_time":0.0,"text":"pqq1t^","timestamps":[0.9599999785423279,1.0,1.0799999237060547],"tokens":["p","qq1","t^"]}
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.951 {"final":true,"segment":0,"start_time":0.0,"text":"pqq1t^fa0j^duua2j^","timestamps":[0.9599999785423279,1.0,1.0799999237060547,1.2799999713897705,1.4399999380111694,1.5199999809265137,1.6799999475479126,1.7599999904632568,2.200000047683716],"tokens":["p","qq1","t^","f","a0","j^","d","uua2","j^"]}
processed.wavpqq1t^fa0j^duua2j^
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:103:Client::Client(asio::io_context&, const string&, int16_t, const string&, float, int32_t, std::string)::<lambda(websocketpp::connection_hdl)> 2023-11-07 08:40:17.952 Disconnected
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:337:int32_t main(int32_t, char**) 2023-11-07 08:40:17.952 Done!
Can you check that your LG is correct?
Have you tested the LG and pre-trained models provided by us in the doc?
Can you check that your LG is correct?
I have used this LG graph to replace trivial_graph
in zipformer/decode.py
and it correctly output grapheme. Is there another preferred way to verify the correctness?
fast_beam_search_nbest
00d0fb48e21f4b0086ac94eb7723f150-27530: ref=['ph', 'aa2', 'p^', 'j', 'a1', 'j^', 'kh', 'vv0', 'z^']
00d0fb48e21f4b0086ac94eb7723f150-27530: hyp=['ph', 'aa2', 'p^', 'j', 'a1', 'j^', 'kh', 'vv0', 'z^']
fast_beam_search_nbest_LG
00d0fb48e21f4b0086ac94eb7723f150-27530: ref=['ภาพ', 'ใหญ่', 'คือ']
00d0fb48e21f4b0086ac94eb7723f150-27530: hyp=['ภาพ', 'ใหญ่', 'คือ']
Have you tested the LG and pre-trained models provided by us in the doc?
I tested two models and it seems to be working.
icefall-asr-librispeech-streaming-zipformer-2023-05-17
./test_wavs/1221-135766-0002.wav
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
{"final":true,"segment":0,"start_time":0.0,"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":[0.6399999856948853,0.7199999690055847,0.9599999785423279,1.03$9999618530273,1.1999999284744263,1.399999976158142,1.5999999046325684,1.6799999475479126,1.71999990940094,1.7999999523162842,1.8399999141693115,2.0399999618530273,2.119999885559082,2.2799999713897705,2.4$0000057220459,2.5199999809265137,2.6399998664855957,2.679999828338623,2.919999837875366,2.9600000381469727,3.240000009536743,3.4800000190734863,3.6399998664855957,3.879999876022339,4.159999847412109,4.27$999732971191,4.319999694824219,4.519999980926514,4.599999904632568,4.679999828338623,4.759999752044678],"tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N$,"NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming
./test_wavs/DEV_T0000000000.wav
对我介绍我想
{"final":true,"segment":0,"start_time":0.0,"text":"对我介绍我想","timestamps":[0.47999998927116394,0.5999999642372131,1.0399999618530273,1.159999966621399,2.2799999713897705,2.3999998569488525],"tokens":["对","我","介","绍","我","想"]}
./test_wavs/DEV_T0000000001.wav
重点三个问题首先表现
{"final":true,"segment":0,"start_time":0.0,"text":"重点三个问题首先表现","timestamps":[0.35999998450279236,0.4399999976158142,1.0799999237060547,1.2400000095367432,1.399999976158142,1.6399999856948853,2.319999933242798,2.4800000190734863,4.679999828338623,4.880000114440918],"tokens":["重","点","三","个","问","题","首","先","表","现"]}
./test_wavs/DEV_T0000000002.wav
分析这一次全球进动脑
{"final":true,"segment":0,"start_time":0.0,"text":"分析这一次全球进动脑","timestamps":[1.1200000047683716,1.399999976158142,1.7999999523162842,2.0,2.240000009536743,2.759999990463257,2.879999876022339,3.0799999237060547,3.2799999713897705,3.4800000190734863],"tokens":["分","析","这","一","次","全","球","进","动","脑"]}
For the following output:
[I] /workdir/Desktop/sherpa/sherpa/sherpa/cpp_api/websocket/online-websocket-client.cc:182:void
Client::OnMessage(websocketpp::connection_hdl, message_ptr) 2023-11-07 08:40:17.951 {"final":true,"segment":0,"start_time":0.0,
"text":"pqq1t^fa0j^duua2j^",
"timestamps":[0.9599999785423279,1.0,1.0799999237060547,1.2799999713897705,1.4399999380111694,1.5199999809265137,1.6799999475479126,1.7599999904632568,2.200000047683716],
"tokens":["p","qq1","t^","f","a0","j^","d","uua2","j^"]}
what are your expected text
and tokens
?
I would expect the text
to be the grapheme of "pqq1t^fa0j^duua2j^", i.e., "เปิด ไฟ ด้วย"
since my lexicon looks something like this:
...
เปิด p qq1 t^
ไฟ f a0 j^
ด้วย d uua2 j^
...
As for tokens
, I'm uncertain but I think it should probably be phoneme since the model was trained to predict phonemes.
@csukuangfj may I inquire the status of this issue? Thank you.
I'm sorry for not getting back to you sooner.
I see the problem now.
During decoding, we save only the decoded tokens https://github.com/k2-fsa/sherpa/blob/a34c2c83bc07ad5f99f44313bf12fa36017ebe17/sherpa/csrc/online-transducer-fast-beam-search-decoder.cc#L133
It is not a problem for BPE-based models since we can get the correct words by just concatenating all the BPE tokens
and removing _
with a space afterward.
For non-BPE-based models, we need to also save the word_id
s during decoding and get the text from the word_ids
We also need to pass words.txt
so that we can map word IDs to strings.
Thank you for responding so quickly. I will attempt to implement the fix as soon as I have the time.
Hello @kasidis-kanwat @csukuangfj : I will really appreciate if you can answer my questions: 1) Which Zipformer model is good for Phone-Based lexicon (with an acceptable result as BPE), which one and recipe do you recommended and are there any changes in the Model parameters ? (this tiny model seems that it does not have a good CER egs/librispeech/ASR/tiny_transducer_ctc) 2) con we converted it to ONNX_int8 model Sherpa? 3) Is it possible to Decode with LM 4) Does it cover the new word feature Contextual biasing (Hotwords) ? 5) Does it cover multiple variant transcriptions per word ?
Thanks in advance,
I'm using a phone-based zipformer but I could not get the server to output graphemes despite the fact that I'm providing an LG graph to both cpp API and python API.
These are what I tried.