tzyll / goparrot

Goodness of Pronunciation (GOP) for oral reading assessment.
44 stars 7 forks source link

Evaluation of results #2

Open MustafaKarabulut opened 1 year ago

MustafaKarabulut commented 1 year ago

Hi @tzyll,

Thank you for this repo.

I've tested the code with an arbitrary wav file that included a sentence from a native speaker.

The gop_score file included the following (0001 is the id of utterance): 0001 -6.240042852613725 1.9895420867603522 -5.296820636805879 while gop_phone file was: 0001 [-2.9510255257288613, -8.125345826148987, -10.693327691819933, -10.56845474243164, -3.023837842280045, -4.766741037368774, -6.409418980280559, -5.732205581665039, -8.128262700261297, -5.793825149536133, -2.448026301229701] [5.229272577497694, -0.3544602394104004, -1.3160901599460177, -3.878640651702881, 4.825611025094986, 3.9436023235321045, 1.164871056874593, 3.7624855995178224, 0.6423842455889728, 2.5303170680999756, 5.3356101092170265] [-2.2281608846452503, -8.025382936000824, -8.703060626983643, -10.293020725250244, -1.7027268409729004, -4.199991464614868, -5.73492956161499, -3.551647472381592, -6.898320752221185, -5.319831967353821, -1.6079537728253532]

Apperantly, gop_score is an average of all phoneme scores. But phoneme scores did not make sense at all. Can you help me understand them?

tzyll commented 1 year ago

Hi, the score is computed from Kaldi nnet3 output in log-softmax. Although not converted to percentages, the relative magnitude matters.

MustafaKarabulut commented 1 year ago

Hi @tzyll

Thank you for the answer. I have no information on how to use Kaldi's nnet3. So please let me ask: Is there a way to get the scores as percentages between 0-100 or scores normalized in some way?

tzyll commented 1 year ago

Running nnet3-compute with option --apply-exp=true may help as it applies exp function to output. https://github.com/tzyll/goparrot/blob/0c1825fb641873e68c9d4ff56096d533f2829256/run.sh#L35 https://github.com/tzyll/goparrot/blob/0c1825fb641873e68c9d4ff56096d533f2829256/run.sh#L40

MustafaKarabulut commented 1 year ago

Hi,

Thank you for your help. I tested it with the parameter you suggested. It looks like after applying this parameter, the first score in the results file (posterior probability I guess) now looks like a ratio:

0001 0.13222444485637722 391.6637121052435 -1773.4000548275612 0002 0.07479964125533567 366.3127066292147 -1550.6301439808826 0003 0.07917093109803584 271.3388546367845 -1908.234789044808 0004 0.03988084122560747 184.34523183930867 -1510.512735071366 0005 0.03500957037780887 207.54561059383937 -1630.929833828393 0006 0.22757585776564115 1011.7200503358383 -1758.814194787892 0007 0.0515811229768653 312.19317601026654 -2064.3589891999386 011c0201 0.6268451303423566 2768.0236013326667 -396.5261355737484 011c0202 0.6541661804632958 3123.1039304234505 -395.03163460834384

While the results (0.62 and 0.65 respectively) for the files provided by you seems plausible, they are too low for the files I tested even though they contain phrases spoken by a native speaker. Can it be about the feature extractor or the nnet3 expecting some specific wave format or something? I am attaching the files I tested if you'd need to check them out.

wave.zip

P.S: I tried to increase sampling rate of the wav files considering the probability that quality might be affecting the results but it did not help.

Edit: I tried a sample audio from the speechocean762 dataset (000060102 of speaker 006 which was scored 10 out of 10 for accuracy and other metrics), the result is again low as the above ones:

000060102 0.19296801980291847 1069.951523771593 -1081.3568705003336

Am I missing something? I suspect so. So I am uploading my test folder in case I am doing something wrong. test.zip

tzyll commented 1 year ago

Wav for training nnet3 here is with 16kHz and 16bit.

MustafaKarabulut commented 1 year ago

Hi @tzyll,

Thank you for getting me back again.

In fact, if a wav file that does not conform to the required specs is given, a script checks and rejects it. So it is impossible to run a wav file that is not with 16Khz sampling rate. I also enabled upsampling/downsampling by --allow-upsample=true and --allow-downsample in case a different wav is given.

Also, the sample file from speechocean726 conforms with the required specs without any further setting.

Any thoughts?

rudransh2004 commented 11 months ago

yes I'm also facing the same problem scores are very low for the speechocean dataset, so please help me how to get it correct

anelibon commented 5 months ago

Hi,

Were you able to fix the problem and get correct scores? I am trying to run it also.

@rudransh2004 @MustafaKarabulut @tzyll