Incorrect training data

KlausBuchegger commented 6 years ago

Hi, so I trained Kaldi using your (old) s5 script, and as a sanity check I tried to decode the training data. When I compared the text file from the training data to my results, I noticed that there seem to be quite a number of errors in the texts. I checked the audio and xml files and saw that the sentences were wrong.

I added a screenshot of a partial vimdiff of the text and my results.

kaldi

Appears to be a mixup, as those sentences do exist, but in other audio files

bmilde commented 6 years ago

Hi, thanks for opening the issue. I'm was aware of this and I'm currently investigating this.

By running https://github.com/tudarmstadt-lt/kaldi-tuda-de/blob/master/s5_r2/local/run_cleanup_segmentation.sh from the new set of scripts in s5_r2 a cleaned training directory is generated. I've compared it to the train set and this seem to be the broken ids:

https://github.com/tudarmstadt-lt/kaldi-tuda-de/blob/master/s5_r2/local/cleanup/problematic_wavs.txt

The cleanup script also decodes the training set (with a biased language model built from the reference), so that's similar to what you did. The ids from s5_r2 are similar to s5, there is just a suffix added for the different microphones (a,b,c) since s5 didn't use all the available data.

Can you send me your diff and/or would you be able to fix the corpus XML's directly? Otherwise since this doesn't seem to be a lot of the total data (1.5%), I'd simply remove the broken data from the next release (v3) of the corpus.

svenha commented 6 years ago

If fixing the XML files needs some helping hands and makes sense, let me know.

bmilde commented 6 years ago

Hi svenha!

That would be greatly appreciated! There is already a list of files with wrong transcriptions in the cleanup folder in the repository. That should be a good starting point!

I'm currently preparing v3 of the corpus, where I sorted these files out into a separate folder. I can send you a link tomorrow.

svenha notifications@github.com schrieb am Mo., 28. Mai 2018, 16:17:

If fixing the XML files needs some helping hands and makes sense, let me know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tudarmstadt-lt/kaldi-tuda-de/issues/15#issuecomment-392539088, or mute the thread https://github.com/notifications/unsubscribe-auth/AJRgetAq1txc8OE-8BcLKaUKG3Ve-HFxks5t3AbzgaJpZM4T37R6 .

bmilde commented 6 years ago

Here is our planned v3 package where hopefully most of the bad utterances are moved into a separate folder: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v3.tar.gz

@KlausBuchegger can you upload your diff and decoded output?

svenha commented 6 years ago

Two questions about the file problematic_wavs.txt.

It lists 730 files, but the directory train_removed/ contains 363 recordings (5 files per recording). Why?
What is is the role of the transcript in problematic_wavs.txt? Is it the incorrect transcript from the current .xml file?

I think the uploads from @KlausBuchegger might be helpful ...

fbenites commented 6 years ago

Hello I am trying to train deepspeech and I got two days ago a feeling it might lies on the data, that the model does not converge well. I checked also the the data and have a rather preliminary impression. As I see some transcriptions got delete but some got mixed, so the text exists but is (was) assigned to other transcription (or of course the same text was spoken multiple times): train/2014-03-17-13-03-33_Kinect-Beam.wav should have the same text as train/2014-05-08-11-48-47_Kinect-Beam.wav

What speaks for mixed (wrong) assignment is that sometimes I have the impression the text was just shifted:

train/2014-03-17-13-03-49_Kinect-Beam.wav should have the same text as train/2014-03-17-13-03-33_Kinect-Beam.wav and train/2014-03-17-13-04-15_Kinect-Beam.wav and train/2014-03-17-13-04-43_Kinect-Beam.wav

The dates are pretty close (they are in my training file neighbor lines). Should we use some sort of work coordination sorting it out?

svenha commented 6 years ago

@fbenites Yes, it would be good to distribute the correction work. Benjamin (bmilde) offered to produce a list of problematic transcripts by running the recognizer on the train set. When this is done, let us partition the manual check work in two parts.

bmilde commented 6 years ago

@fbenites @svenha thanks for offering to help. I'm running the decode on the train set right now and should have the results soon.

@svenha As for the number of problematic utterances in the proposed v3 tar: the numbers are different since I excluded all wav files of all microphones, whenever at least one decode of one microphone fails. There are multiple microphone recordings of the same utterance. Better safe than sorry. Still, only about 1.5% of all utterances are problematic. But the problematic files tend to be in the same recording session(s).

@fbenites Low RNN-CTC performance will probably remain, even if we fix all the problematic files. There is probably not enough data for end-to-end RNN training (40h x number of microphones, but that is more like doing augmentation learning on 40h of data). What kind of WERs are you seeing with DeepSpeech at the moment? Are you training with or without a phonetic dictionary? Utterance variance is unfortunately also fairly low, there are only about 3000 distinct sentences in train/dev/test combined, so I suggest using a phoneme dictionary if possible. I also suggesting adding German speech data from SWC (https://nats.gitlab.io/swc/), worked very well in the our r2 scripts in this repository (18.39% WER dev / 19.60% WER test now).

fbenites commented 6 years ago

@bmilde Thanks, I will have a look at the wikipedia. I am not certain, I removed some problematic files I am using also: github.com/ynop/audiomate to process which cover some files already, but I added the other 700 to blacklist. I will have some results tomorrow. WER is also complicated see https://github.com/ynop/deepspeech-german sometimes the text is just missing some chars or spaces, which drops the WER a lot. I had some useful results only using voxforge contradicting the results with voxforge and tuda. I will check the phoneme dictionary in deepspeech, thx.

bmilde commented 6 years ago

@fbenites @svenha

I uploaded the decode of the tuda train set to: http://speech.tools/decode_tuda_train.tar.gz

This file might be interesting:

exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/wer_details/per_utt

or alternatively you can also diff e.g.:

diff exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/penalty_0.5/10.txt exp/chain_cleaned/tdnn1f_1024_sp_bi/decode_tuda_train/scoring_kaldi/test_filt.txt

Though this doesn't look so pretty in standard diff, so some graphical diff-like tool like in the screenshot from Klaus is probably a better idea.

svenha commented 6 years ago

Thanks @bmilde . I investigated the file per_utt with a little script that sums the isd part of the "cisd" fields. I count all of them as errors of equal weight. Then I used an error threshold t. For t=4, 2490 affected files; for t=5, 1788 files, for t=6,1511 files. If I ignore the microphone suffix (_a etc.), there remain 616 files to be checked for t=6. As we have only 2 annotators, I would suggest to use t=6. (We can repeat the decoding of the train set with the improved corpus and choose a second annotation round.) If you agree, I can produce the file list, sort it, and cut it in the middle. I would take the first half, @fbenites the second part?

@bmilde : How should we contribute the changes? We could edit the .xml files, collect all changed .xml files and send them to you. But I am open to other approaches, like a normal pull request (if the repo is not too large for this).

One final point: in per_utt, there are file names like 02dae8284f104451a8de85538da6fdec_20140317140355_a . Is it safe to assume that it corresponds 1:1 to an xml file derived from the date/time part, here 2014-03-17-14-03-55.xml ?

bmilde commented 6 years ago

Many thanks @svenha !

Yes, it is safe to assume that 02dae8284f104451a8de85538da6fdec_20140317140355_a belongs to 2014-03-17-14-03-55.xml

02dae8284f104451a8de85538da6fdec is the speaker hash, the last letter indicates the microphone, between that the IDs contain the timestamp without the dashes. Note that it's also safe to assume that _a, _b, _c etc all belong to the same utterance and that they should contain the same transcription. If all of them decode to something else, it's safe to assume an incorrect transcription.

Maybe it's also a good idea to make the threshold t depended on the length of the transcription, there are some very short utterances, too. But I can also always rerun the decoding for you after getting corrected xml files.

Since the corpus is not hosted on Github, it's probably easier if you send me the corrected xml files directly. But you can also send me a pull request, containing only the corrected xml files, placed somewhere in a subfolder of the local directory. I can then also write a script that checks the corpus files and patches them if needed, so that it is not necessary to redownload the whole tar.gz file.

One final point: Note that the xml files have sentence IDs.

E.g. sentence_id 59 in:

<?xml version="1.0" encoding="utf-8"?><recording><speaker_id>02dae828-4f10-4451-a8de-85538da6fdec</speaker_id><rate>16000</rate><angle>0</angle><gender>male</gender><ageclass>21-30</ageclass><sentence_id>59</sentence_id><sentence>Rom wurde damit zur ‚De-Facto-Vormacht‘ im ö
stlichen Mittelmeerraum.</sentence><cleaned_sentence>Rom wurde damit zur De Facto Vormacht im östlichen Mittelmeerraum</cleaned_sentence><corpus>WIKI</corpus><muttersprachler>Ja</muttersprachler><bundesland>Hessen</bundesland><sourceurls><url>https://de.wikipedia.org/wiki/R|
ömisches_Reich</url></sourceurls></recording>

A text file with all of the sentence IDs and transcriptions is in the root of the corpus archive. Since there are multiple recordings per sentence ID, it is very unlikely that the correct sentence is not included. Maybe we can also just try to find the closest match automatically and check that its correct manually?

svenha commented 6 years ago

I switched from an absolute threshold to a relative threshold as suggested by @bmilde : number_of_errors / number_of_words >= 0.2 I include an xml file only if all of its recordings (i.e. microphones a, b, c, and d) fufill this criterion. This gave me 744 xml file names that I attach in two parts of 372 file names each below. I will check the first part (files1) now and send corrected xml files. (We will see how theses changes must be propagated to the SentencesAndIds files.) If the sentence-id was moved by 1 or similar (as noted by others), I will correct the sentence-id and delete the elements sentence and cleaned_sentence.

tuda-20perc-err.files1.txt tuda-20perc-err.files2.txt

svenha commented 6 years ago

tuda-20perc-err.files1.txt is finished; my manual correction speed was around 2 minutes per recording. I will send the files to @bmilde. Any volunteers for tuda-20perc-err.files2.txt? (If not, I might have time in August.)

svenha commented 6 years ago

Just to avoid duplicate work, I would let you know that I am working on files2.

svenha commented 6 years ago

The second half (i.e. files2) were finished some weeks ago. Is there a rough estimate of the release date of tuda v3?

wolfgang-s commented 6 years ago

Great work @silenterus

Do you have an update on the WER?

Do you plan on including data from common voice? They claim to already have 127 h of validated german voice https://voice.mozilla.org/

fbenites commented 6 years ago

Hi,

Sorry for the radio silence, I was caught up in other projects. Where are the cleaned data? I would like to check them further. at http://speech.tools/ I get a 403.

Thanks again!

akoehn commented 6 years ago

@silenterus: It would be nice to keep the discussion on-topic. This bug is about mix-ups and errors in the data files. Feel free to open a new bug about training deep speech with our data.

@fbenites: speech.tools is hosted by @bmilde afaik, maybe he can fix that.

silenterus commented 6 years ago

Sry you are absolutly right. I will put the results on my git. Keep up the good work

svenha commented 3 years ago

Are there any plans to integrate all the corrections from 2018 or later into a new corpus version?

Alienmaster commented 2 years ago

I created a repository with the whole ~20GB Dataset here: https://github.com/Alienmaster/TudaDataset For the revision v4 i removed the incorrect data mentioned here and added the corrections made by @svenha . Currently i train on this dataset without any errors. Feel free to download and test the new revision.

svenha commented 2 years ago

@Alienmaster Thanks for picking this up and the clever issue template in the new TudaDataset repo.

If you have any new evaluation results, please let us know :-)

bmilde commented 2 years ago

Closing this, release 4 of the tuda dataset contains the fixes: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v4.tar.gz

Our CV7 branch uses this already to train the models, together with the newest Common voice data. See https://github.com/uhh-lt/kaldi-tuda-de/tree/CV7

uhh-lt / kaldi-tuda-de

Incorrect training data #15