huggingface / dataspeech

MIT License
310 stars 47 forks source link

Inaccurate Labels in Dataset #29

Open lmxue opened 4 months ago

lmxue commented 4 months ago

I have encountered inaccuracies in the labels provided in the dataset at https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated.

The code: from datasets import load_dataset test_set = load_dataset("parler-tts/mls-eng-10k-tags_tagged_10k_generated", split="test") test_set[0]

The output: {'original_path': 'http://www.archive.org/download/lesmis3_0911_0911/lesmiserables_vol3_22_hugo_64kb.mp3', 'begin_time': 119.15, 'end_time': 132.26, 'audio_duration': 13.109999999999983, 'speaker_id': '7171', 'book_id': '3158', 'utterance_pitch_mean': 172.13397216796875, 'utterance_pitch_std': 71.41407012939453, 'snr': 47.84040069580078, 'c50': 57.13105392456055, 'speaking_rate': 'slightly slowly', 'phonemes': 'ʌnd hi nu ðʌ ʌndʒʌst ʃeɪm ʌnd ðʌ pɔɪnjʌnt blʌʃʌz ʌv ædmɜ˞ʌbʌl ʌnd tɛɹʌbʌl tɹaɪʌl fɹʌm wɪtʃ ðʌ fibʌl ɪmɜ˞dʒ beɪs fɹʌm wɪtʃ ðʌ stɹɔŋ ɪmɜ˞dʒ sʌblaɪm', 'gender': 'male', 'pitch': 'very high pitch', 'noise': 'moderate ambient sound', 'reverberation': 'very confined sounding', 'speech_monotony': 'slightly expressive', 'text_description': ' A man speaks with a slightly expressive tone in a confined space, his voice echoing slightly but overall sounding quite clear, with moderate ambient sound in the background. His pitch is very high, but his delivery is only slightly slower than normal.', 'original_text': 'and he knew the unjust shame and the poignant blushes of wretchedness admirable and terrible trial from which the feeble emerge base from which the strong emerge sublime', 'text': 'And he knew the unjust shame and the poignant blushes of wretchedness. Admirable and terrible trial from which the feeble emerge base, from which the strong emerge sublime.'}

Analysis:

However, after listening to the audio of http://www.archive.org/download/doublelifeofalfredburton_1801_librivox/doublelifealfredburton_14_oppenheim_64kb.mp3, I found that the begining time and end time of 'text': 'Mr Cowper looked at his visitor in amazement, my young friend. He said: are you going to tell me that you have seen one of these beans? Not only that, but i have eaten one. Burton said, in fact, i have eaten two.' in the audio are 1.59 and 2.11 minutes., which are not aligned with the labels in the dataset.