kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
13.97k stars 5.3k forks source link

Goodness of Pronunciation (GOP) #3675

Closed naxingyu closed 4 years ago

naxingyu commented 4 years ago

I see on the recently compiled FAQ page that GOP is frequently asked. @jimbozhang published his implementation a while ago using GMM. It would be helpful if he update it to NN-based and setup a small example using pretrained models.

jimbozhang commented 4 years ago

Thanks @naxingyu . The GMM-GOP implementation is not good, please ignore it.

I'm working on DNN-GOP and phone-level pronuncation features these days. Currently the nnet2 version is almost ready: https://github.com/jimbozhang/kaldi/tree/gop

I will complete the nnet2 version tonight or tomorrow, and then migrate it to nnet3.

It would be convenient if there is a pre-trained nnet3 TDNN model (no chain) to test TDNN-GOP.

jimbozhang commented 4 years ago

The nnet2 version has been done: https://github.com/jimbozhang/kaldi/commit/02198ae6fddce9cb5d898b61374e35492aa3cef1

Could you please to check it? @danpovey @naxingyu I'll migrate the scripts to nnet3 soon. The binaries are same for nnet2 and nnet3.

danpovey commented 4 years ago

I made a couple comments... needs more clarification I think.

jimbozhang commented 4 years ago

Thank you @danpovey , I'll replace the binary convert-phone-ali by script and add more comments in the codes to illustrate the approach.

naxingyu commented 4 years ago

I suggest develop nnet3 based scripts, as it is more actively maintained, and set up an example specified for gop, referring to the apiai example. I have a pre-trained Librispeech nnet3 tdnn model. I'll make it available for you @jimbozhang

naxingyu commented 4 years ago

@jimbozhang please make a PR when you have a draft version so that we can proceed.

jimbozhang commented 4 years ago

I have set up a new egs and removed nnet2-based implements: https://github.com/jimbozhang/kaldi/commit/0fa5b5e0106dcf9370ce061eb33629d3c245d27a

This version is buggy and still lack of clarification. I will make a PR once it is all ready. @naxingyu

jimbozhang commented 4 years ago

I think the following version should be better: https://github.com/jimbozhang/kaldi/tree/gop/egs/gop

@naxingyu @danpovey

naxingyu commented 4 years ago

If you really need to render the equations, consider using this editor. As for the Librispeech nnet3 tdnn model, I'll load it to kaldi-asr.org ASAP for you to complete the preparation scripts.

jimbozhang commented 4 years ago

The PR #3703 was made.

jsalsman commented 4 years ago

@brijmohan asked me to comment on this, which I discussed with @danpovey briefly in April 2018.

GOP scores based on the posterior probabilities of phone recognition are at best only weakly correlated with the actual intelligibility of an utterance, because of context effects and the fact that ambiguities arise from the alternative utterance possibilities constrained by the set of dictionary words and their grammar. Educational Testing Service (ETS), who have been working harder on this longer and with more highly skilled effort than anyone, have only been able to achieve about 58% agreement with human pronunciation judges on isolated utterances as of last year. The mistake stretches back to section 2.2.2 in this seminal ICSLP-90 paper. A thorough review of all such GOP score formulations is on pp. 7-11 of this IIB report. They are all terrible for language learning applications because they are so poorly correlated with intelligibility.

The correct alternative is to train models based on authentic intelligibility derived from whether the text listeners report hearing in utterances matches the intended speech, a technique due originally to Prof. Seiichi Nakagawa and his graduate student Hiroshi Kibishi's work in 2011. ETS had in 2014 attempted to patent this technique before they were aware that Nakagawa had been publishing on it. At this time there is only one commercial system using the authentic intelligibility technique, which Brij and I built in 2017. There are some technical slides from April the first handful of which attempt to illustrate the reasons why intelligibility outperforms GOP scores. Nakagawa's method requires training models from phonological features (which can be derived, as Brij and I do, from N-best lists of multiple passes of tri-segment recognition on all phonemes in anchoring contexts derived from forced alignment -- this method would certainly work as well or better in Kaldi than it does in PocketSphinx.) However, the the use of such phonological features requires a completely new kind of training data composed of dozens of learner utterances per intended prompt word or phrase, each labeled with several blinded listener transcriptions.

This kind of data is exceptionally difficult to come by, but within reach at large scale from Mozilla's Common Voice project. Would you please help by joining in asking @Gregoor and @nukeador on the issue here to support Common Voice's ability to provide data to support intelligibility assessment and remediation?

jimbozhang commented 4 years ago

Thanks for @jsalsman 's comments.

Indeed, GOP has drawbacks, and recently there have been more and more papers on mispronunciation detection.

But the advantages of GOP are remarkable:

  1. It is the classical method. The performance has been proved in studies and commercial products.
  2. With a pre-trained ASR model, it does not need any human-labeled data for training.
  3. The implementation is simple.

Kaldi has not had a recipe for mispronunciation detection. So GOP is the natural option as the first recipe.

As your suggestion, once we have training data with human-labels, we will consider implementing those approaches.

jsalsman commented 4 years ago

@jimbozhang Thanks for your reply. When an open source transcriptions database does become available, GOP scores are going to quickly fall from favor. Calling any of the scores based on phoneme recognizers' posterior probabilities "goodness of pronunciation" is like saying a regression prediction's residual is an error in the source data.

What is the best performance for GOP scores relative to human judges you have seen reported in the peer reviewed literature? I know there are commercial offerings claiming 97+% in their marketing collateral, including from Prof. Bernstein's company Versant (formerly Ordinate, now part of Pearson) but those are composites based on the copula of typically five individual trials, and it's still inaccurate, because combining GOP scores still doesn't correlate with authentic intelligibility.

I realize there is a huge push for this now because the Chinese National College Entrance Examination's English speaking component (which represents 10% of the total score) has been decreed to be automated in 2020. But there will be many entrants using GOP scores, and maybe two or three using authentic intelligibility; surely the producers of the latter are going to publish benchmarks against their competitors. Do you think the Chinese government will allow anyone to use the GOP scores once those side-by-side comparison benchmarks are published?

Today Google Scholar recommended this article to me. It has three blatant and confusing spelling errors in the abstract, and one of those is repeated in the title ("high-states" should be "high-stakes".) The issue with GOP scores is worse, though, because it is the sort of carelessness which crosses the line from chabuduo to professional misconduct for which engineers can be held financially liable for negligence in court, both in the US and China.

naxingyu commented 4 years ago

this can be closed @danpovey

zhushaojin commented 4 years ago

@jimbozhang Hi jim, I ran into an error which I'm not sure whether due to my false process or the bug of gop code.

In the script of "compute-gop", it may mis-combine two phone

The input text is "AMERICA, AMERICA, AMERICA' The alignment output is correct:

spk_000707_000707 [ 2 1 1 1 1 1 1 ] [ 3668 ] [ 27960 ] [ 14614 ] [ 32336 ] [ 22188 ] [ 26376 ] [ 3976 ] [ 3722 ] [ 27960 ] [ 14614 ] [ 32336 ] [ 22188 ] [ 26376 ] [ 3976 ] [ 3722 ] [ 27960 ] [ 14614 ] [ 32336 ] [ 22188 ] [ 26376 ] [ 3976 ] [ 2 1 1 1 1 1 1 1 1 1 ] spk_000707_000707 SIL AH0_B M_I EH1_I R_I IH0_I K_I AH0_E AH0_B M_I EH1_I R_I IH0_I K_I AH0_E AH0_B M_I EH1_I R_I IH0_I K_I AH0_E SIL

Then I used the compute-gop to compute GOP score However the output of gop is as follow (I translated it into phone-score pair)

SIL -1.153042 AH 0 M -0.7004871 EH -0.7042251 R -0.8950429 IH -0.3069291 K -0.8031902 AH 0 M -0.7745457 EH -0.7189884 R -0.6910372 IH -0.233676 K -0.847949 AH 0 M -0.6344199 EH -0.7651992 R -0.7731686 IH -0.2464671 K -0.8914728 AH 0 SIL -3.559371

As above, two connected "AH0_E AH0_B" is combined as one "AH" Is this the feature of compute-gop? Or I did something wrong?

jimbozhang commented 4 years ago

@jimbozhang Hi jim, I ran into an error which I'm not sure whether due to my false process or the bug of gop code.

In the script of "compute-gop", it may mis-combine two phone

The input text is "AMERICA, AMERICA, AMERICA' The alignment output is correct:

spk_000707_000707 [ 2 1 1 1 1 1 1 ] [ 3668 ] [ 27960 ] [ 14614 ] [ 32336 ] [ 22188 ] [ 26376 ] [ 3976 ] [ 3722 ] [ 27960 ] [ 14614 ] [ 32336 ] [ 22188 ] [ 26376 ] [ 3976 ] [ 3722 ] [ 27960 ] [ 14614 ] [ 32336 ] [ 22188 ] [ 26376 ] [ 3976 ] [ 2 1 1 1 1 1 1 1 1 1 ] spk_000707_000707 SIL AH0_B M_I EH1_I R_I IH0_I K_I AH0_E AH0_B M_I EH1_I R_I IH0_I K_I AH0_E AH0_B M_I EH1_I R_I IH0_I K_I AH0_E SIL

Then I used the compute-gop to compute GOP score However the output of gop is as follow (I translated it into phone-score pair)

SIL -1.153042 AH 0 M -0.7004871 EH -0.7042251 R -0.8950429 IH -0.3069291 K -0.8031902 AH 0 M -0.7745457 EH -0.7189884 R -0.6910372 IH -0.233676 K -0.847949 AH 0 M -0.6344199 EH -0.7651992 R -0.7731686 IH -0.2464671 K -0.8914728 AH 0 SIL -3.559371

As above, two connected "AH0_E AH0_B" is combined as one "AH" Is this the feature of compute-gop? Or I did something wrong?

It is a bug. Thanks to point it out. @zhushaojin I'll fix it soon.

jsalsman commented 4 years ago

@danpovey would you please let me know whether or not you agree with my analysis?

danpovey commented 4 years ago

@jsalsman I'm not opposed to the methods you advocate- I think they make sense- but until someone makes a PR implementing them, there's not much I plan to do about it.

jsalsman commented 4 years ago

@danpovey re:

until someone makes a PR implementing them, there's not much I plan to do about it.

Please re-open this issue until either the short-term or long-term defects are corrected, or both.

ArtemisZGL commented 4 years ago

Hello, I tried to run the gop recipe, and I found the model in https://kaldi-asr.org/models/m13 is quiet different with the script description(https://github.com/kaldi-asr/kaldi/blob/master/egs/gop/s5/run.sh), so @naxingyu @jimbozhang which model did you use? Because I try to replace the model with my own tdnn_1d_sp, the result is quiet bad.

jimbozhang commented 4 years ago

Hello, I tried to run the gop recipe, and I found the model in https://kaldi-asr.org/models/m13 is quiet different with the script description(https://github.com/kaldi-asr/kaldi/blob/master/egs/gop/s5/run.sh), so @naxingyu @jimbozhang which model did you use? Because I try to replace the model with my own tdnn_1d_sp, the result is quiet bad.

Chain model is not fit for GOP scoring. I suggest to train your scoring model with the script egs/librispeech/s5/local/nnet3/run_tdnn.sh.

sean37 commented 4 years ago

@jimbozhang my system is on the chain model and I'd like to know how to get reliable gop scores given the chain model. do you have any idea or plan for that?

danpovey commented 4 years ago

You could maybe use the "output-xent" output instead of the one named "output"

On Sat, May 23, 2020 at 11:57 AM seansshin notifications@github.com wrote:

@jimbozhang https://github.com/jimbozhang my system is on the chain model and I'd like to know how to get reliable gop scores given the chain model. do you have any idea or plan for that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3675#issuecomment-632980646, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5LHBRD6NC7N4SUAVLRS5CSHANCNFSM4JDMU46Q .

jimbozhang commented 4 years ago

@jimbozhang my system is on the chain model and I'd like to know how to get reliable gop scores given the chain model. do you have any idea or plan for that?

As @danpovey said, use "output-xent" as the output is a solution. But the performance should be a bit worse than simply using a xent model (I have tried). So my suggestion is just do not use the chain model for gop scoring.

jsalsman commented 4 years ago

Ethics problem: by pulling a lever, you can make the foundation of a building last orders of magnitude longer in accordance with the architect's instruction. "The limitations of the GOP measure were apparent from the outset and I was surprised that it caught on and was so widely used." -- Steve Young, Professor Emeritus, Cambridge University, March 25, 2020. By standing back while the lever is not pulled, the building will collapse in days to weeks, quite possibly taking those who have neglected to help pull the lever with it.

danpovey commented 4 years ago

I am working on a new framework that should do better with all kinds of confidences. Right now I am not investing time in that type of thing.

On Sun, May 24, 2020 at 12:37 AM James Salsman notifications@github.com wrote:

Ethics problem: by pulling a lever, you can make the foundation of a building last orders of magnitude longer in accordance with the architect's instruction. "The limitations of the GOP measure were apparent from the outset and I was surprised that it caught on and was so widely used." -- Steve Young, Professor Emeritus, Cambridge University, March 25, 2020. By standing back while the lever is not pulled, the building will collapse in days to weeks, quite possibly taking those who have neglected to help pull the lever with it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3675#issuecomment-633090289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZI5IBZDLFJMUBKOZTRS73TFANCNFSM4JDMU46Q .

sean37 commented 4 years ago

@danpovey it sounds very interesting. could you briefly explain what the new framework is and how it works?

danpovey commented 4 years ago

Still very early in development, nowhere near usable. See k2 project on my github.

On Sun, May 24, 2020 at 1:32 PM seansshin notifications@github.com wrote:

@danpovey https://github.com/danpovey it sounds very interesting. could you briefly explain what the new framework is and how it works?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3675#issuecomment-633182263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZFYJ5SJKM2Q4LD4MDRTCWNTANCNFSM4JDMU46Q .

ArtemisZGL commented 4 years ago

@jimbozhang Thanks ! I will try that.

ahmedalbahnasawy commented 4 years ago

Regrading @danpovey suggestion https://github.com/kaldi-asr/kaldi/pull/3703. I used TDNN-F chain model and the result was similar to nnet3 model. @jimbozhang could you check this sample test. Thanks nnet3-vs-tdnn-f.txt

ArtemisZGL commented 4 years ago

@jimbozhang Hello, I want to ask one more question about the speed. I got one result for audio of 15 seconds using about 3s. It spent about 1s each for computing output, align and compute gop score. And it seems kaldi have to load the model everytime when I run the scripts. Does it have any ways to load the model to memory once and using it to assess many times?

jimbozhang commented 4 years ago

@jimbozhang Hello, I want to ask one more question about the speed. I got one result for audio of 15 seconds using about 3s. It spent about 1s each for computing output, align and compute gop score. And it seems kaldi have to load the model everytime when I run the scripts. Does it have any ways to load the model to memory once and using it to assess many times?

If you want to do so, you have to integrate the c++ codes by yourself. @ArtemisZGL

The following codes might be as an example: https://github.com/jimbozhang/kaldi-gop/blob/master/src/gopbin/compute-gmm-gop.cc

jsalsman commented 4 years ago

@danpovey Would you please telephone me at +1-970-616-1934? You are allowing others to tarnish your good name by associating atrociously poor engineering and negligence with your project. I would like the opportunity to convince you to put an end to this farce. If you prefer that I ask Professor Young to reach out on this, please inform me at your earliest convenience.

danpovey commented 4 years ago

@jsalsman I haven't looked deeply enough into alternate GOP methods to have an opinion on thsi stuff. The reason I don't create an issue is because I don't have time to work on this stuff and all effort is going to a next generation which anyway should have better confidence values.

jsalsman commented 4 years ago

@dpovey, the problem is with the assumption that correct pronunciation causes recognition confidence scores to increase. It does with only a weak correlation, and improving confidence scores can almost as easily hurt intelligibility assessment performance when "GOP" scores are used, especially when people insist on using formulas designed for and tuned on HMMs. That is not going anywhere because the unit mismatch is even worse. Actual intelligibility of a word or phoneme has to do with the audio before and after the segment in question, but "context dependent" is not a perfect term for that, because it doesn't mean the same here as in grammars.

If intelligibility assessment is the actual goal, then all those problems go away and you aren't limited by trying to optimize through the vague correlation or with formulas designed for a different technology.

What we need to do that is phonological model output like the ANN signal processing from the CSLU Toolkit (or its closed source counterpart, the Sensory FluentSoft/TrulyNatural recognizer.)

jsalsman commented 4 years ago

@danpovey sorry I didn't ping you correctly when I replied by email ^

Are you familiar with the CSLU Toolkit diphone ANN phonological recognition architecture?

danpovey commented 4 years ago

No I'm not familiar with that. BTW the next iteration of Kaldi, that I'm working on, will have much more emphasis on confidence estimation. That's where I'm putting most of my energy now.

On Tue, Jul 14, 2020 at 8:29 AM James Salsman notifications@github.com wrote:

@danpovey https://github.com/danpovey sorry I didn't ping you correctly when I replied by email ^

Are you familiar with the CSLU Toolkit diphone ANN phonological recognition architecture?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3675#issuecomment-657897966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZ3GQJK6AGQZG4HVJ3R3ORHVANCNFSM4JDMU46Q .

jsalsman commented 4 years ago

@danpovey I look forward to that. Good confidence scores are a necessary accompaniment to good N-best results ordering, which is how you can derive phonological features from 3-phone recognition.

image

image

The problem is, perfect confidence scores alone, even if they were possible, can only very imperfectly predict intelligibility. The context, i.e., the confidence scores of the neighboring phones, is also a factor (many factors?)

danpovey commented 4 years ago

Thanks!

On Tue, Jul 14, 2020 at 10:11 PM James Salsman notifications@github.com wrote:

@danpovey https://github.com/danpovey I look forward to that. Good confidence scores are a necessary accompaniment to good N-best results ordering, which is how you can derive phonological features from 3-phone recognition.

[image: image] https://user-images.githubusercontent.com/3393748/87435642-deb40380-c5a0-11ea-8c59-46abf7db82c1.png

[image: image] https://user-images.githubusercontent.com/3393748/87435691-ee334c80-c5a0-11ea-9504-94a6ec0bcac8.png

The problem is, perfect confidence scores alone, even if they were possible, can only very imperfectly predict intelligibility. The context, i.e., the confidence scores of the neighboring phones, is also a factor (many factors?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3675#issuecomment-658203182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO43WJRCBK7D6MBOX2LR3RRRBANCNFSM4JDMU46Q .

hpqiu commented 9 months ago

@jimbozhang Hello, I want to ask one more question about the speed. I got one result for audio of 15 seconds using about 3s. It spent about 1s each for computing output, align and compute gop score. And it seems kaldi have to load the model everytime when I run the scripts. Does it have any ways to load the model to memory once and using it to assess many times?

If you want to do so, you have to integrate the c++ codes by yourself. @ArtemisZGL

The following codes might be as an example: https://github.com/jimbozhang/kaldi-gop/blob/master/src/gopbin/compute-gmm-gop.cc

The nnet2 version has been done: jimbozhang@02198ae

Could you please to check it? @danpovey @naxingyu I'll migrate the scripts to nnet3 soon. The binaries are same for nnet2 and nnet3.

@jimbozhang 针对中文普通话的,有语音评分模型吗?如果需要评测的文本不多的话,自己训练模型麻烦吗?在git中查了很久,gop的项目很少,这方面技术不成熟吗?谢谢!

jsalsman commented 9 months ago

@hpqiu 不仅这方面的技术不成熟,而且它在根本上存在缺陷,且在科学上是不道德的。因为它导致了不可争议的严重错误,如 https://youtu.be/DTj7VILryRo 所示。您应该评估听众是否能够理解一个发音不当的话语,即其可理解性,而不是它是否符合记录训练数据的平均发音所反映的某种理想化的口音 (GOP)。

jimbozhang commented 9 months ago

@jimbozhang 针对中文普通话的,有语音评分模型吗?如果需要评测的文本不多的话,自己训练模型麻烦吗?在git中查了很久,gop的项目很少,这方面技术不成熟吗?谢谢!

@hpqiu I'm not certain if there is a pretrained model available for Mandarin pronunciation assessment. You might need to train one yourself.

Additionally, it's worth noting that the GOP method may be outdated, and the recipe provided is merely a baseline for the 'speechocean762' dataset. I recommend considering more contemporary approaches, such as the one in this paper: https://arxiv.org/abs/2306.02682.

jsalsman commented 9 months ago

@hpqiu @jimbozhang here is another recent approach which may offer more utility per cost: https://www.isca-speech.org/archive/pdfs/slate_2023/wei23_slate.pdf

I'm frustrated that I can't find any intelligibility approaches for Mandarin learners.

a2d8a4v commented 6 months ago

Hi, @jimbozhang. Thank you for your implementation of compute-gop. Despite this, I'm a little curious about the GOP features results retrieved from compute-gop.

For instance, GOPT publicly provides extracted GOP features for the speechocean762 dataset (via a librispeech pretrained acoustic model), in which the first phoneme in the utterance 000030012, for example:

2.400000000000000000e+01,-1.000000000000000000e+01,-1.000000000000000000e+01,-1.000000000000000000e+01,5.365714550018310547e+00,5.189569473266601562e+00,6.190038681030273438e+00,5.138529300689697266e+00,4.796952724456787109e+00,5.254307270050048828e+00,5.035524368286132812e+00,4.299823284149169922e+00,5.053345680236816406e+00,4.465762138366699219e+00,5.392735004425048828e+00,5.603890419006347656e+00,5.090937137603759766e+00,4.962147235870361328e+00,4.706232070922851562e+00,4.564249992370605469e+00,5.727764129638671875e+00,5.460106372833251953e+00,4.366916179656982422e+00,5.075336456298828125e+00,5.749814987182617188e+00,6.595030784606933594e+00,5.697445869445800781e+00,3.835732460021972656e+00,5.360883712768554688e+00,3.917288780212402344e+00,5.020288467407226562e+00,5.439896106719970703e+00,5.351333618164062500e+00,4.589535236358642578e+00,5.177958965301513672e+00,4.498236179351806641e+00,4.042528629302978516e+00,5.085375308990478516e+00,4.981881141662597656e+00,4.874222755432128906e+00,4.687701702117919922e+00,4.831391811370849609e+00,2.951301813125610352e+00,1.659503173828125000e+01,1.659503173828125000e+01,1.659503173828125000e+01,1.229316234588623047e+00,1.405461311340332031e+00,4.049921035766601562e-01,1.456501483917236328e+00,1.798078060150146484e+00,1.340723514556884766e+00,1.559506416320800781e+00,2.295207500457763672e+00,1.541685104370117188e+00,2.129268646240234375e+00,1.202295780181884766e+00,9.911403656005859375e-01,1.504093647003173828e+00,1.632883548736572266e+00,1.888798713684082031e+00,2.030780792236328125e+00,8.672666549682617188e-01,1.134924411773681641e+00,2.228114604949951172e+00,1.519694328308105469e+00,8.452157974243164062e-01,0.000000000000000000e+00,8.975849151611328125e-01,2.759298324584960938e+00,1.234147071838378906e+00,2.677742004394531250e+00,1.574742317199707031e+00,1.155134677886962891e+00,1.243697166442871094e+00,2.005495548248291016e+00,1.417071819305419922e+00,2.096794605255126953e+00,2.552502155303955078e+00,1.509655475616455078e+00,1.613149642944335938e+00,1.720808029174804688e+00,1.907329082489013672e+00,1.763638973236083984e+00,3.643728971481323242e+00

24 is the phoneme index number in the librispeech lexicon, -10.0 represents the non-verbal tokens and silence. But next, the former numbers after -10.0 : 5.365714550018310547 , ... is a positive number. According to the paper, LPP is the weighted sum of log-conditioned probability, which should be a negative number.

I also tried implementing LPP by myself. However, I could not get the approximate results, and it also shows a negative number.

# prob_phone_matrix: [frames_number, phonemes_num_in_dict], sum(prob_phone_matrix[i, :]) = 1
tf = ti + len(transitions_by_phone) - 1
lpp = (sum(np.log(prob_phone_matrix[ti:tf+1])))/(tf-ti+1)

Do I miss something in the process of understanding LPP? Or other further considerations during implementing compute-gop we do not know? Hope for your further assistance.

jimbozhang commented 6 months ago

Hi ヤンヤン @a2d8a4v ,

  1. For any issue with some codes other than Kaldi (such as "GOPT"), please kindly reach out to their authors for assistance.

  2. It should be beneficial to compare the results between your LPP implementation and the code found at: https://github.com/kaldi-asr/kaldi/blob/21ae411fd46282726d893e53c05fef5baea64fef/src/bin/compute-gop.cc#L86.

a2d8a4v commented 6 months ago

Thank you for replying to me @jimbozhang

Let me explain that both the example (GOPT) and I used compute-gop from kaldi to extract the [LPP42, LPR42] features. We use the Librispeech pretrained AM from here. And we both get positive numbers in LPP.

I am unsure which parts lead to these results. And I may take some time to check my code again.

jimbozhang commented 6 months ago

@a2d8a4v The pretrained model at https://kaldi-asr.org/models/m13 is based on the chain (LFMMI), which may not be suitable for GOP. You may want to consider using only the output-xent (rather than the original output, which may cause the LPP positive). However, based on my experience, the results should still be not good:

https://github.com/kaldi-asr/kaldi/blob/21ae411fd46282726d893e53c05fef5baea64fef/egs/gop_speechocean762/README.md?plain=1#L92

A recommended recipe can be found at https://github.com/kaldi-asr/kaldi/tree/master/egs/librispeech/s5/local/nnet3, which exclusively uses CE (no MMI) as the loss and the vanilla HMM topo (no chain).

a2d8a4v commented 6 months ago

@jimbozhang I got it! Maybe let this problem remain there cause the performance will not degrade or be unstable under the LFMMI chain. Thank you again for informing the details!