lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
935 stars 214 forks source link

Common voice wrong metadata added to supervision set #1325

Open laudominik opened 5 months ago

laudominik commented 5 months ago

In common_voice.py in _parse_utterance for SupervisionSegment.text audio_info[2] (sentence id) is being set rather than audio_info[3] (sentence).

For reference: recently downloaded common_voice_pl dataset has the following columns (in .tsv files):

  1. client_id
  2. path
  3. sentence_id
  4. sentence
  5. sentence_domain
  6. up_votes
  7. down_votes
  8. age
  9. gender
  10. accents
  11. variant
  12. locale
  13. segment

Is this a bug or am I missing something?

daniel-dona commented 5 months ago

I found the same problem, fixed changing the _parse_utterance function. Probably at some release of the corpus they changed the number of parameters.

def _parse_utterance(
    lang_path: Path,
    language: str,
    audio_info: str,
) -> Optional[Tuple[Recording, SupervisionSegment]]:
    audio_info = audio_info.split("\t", -1)
    audio_path = lang_path / "clips" / audio_info[1]

    if not audio_path.is_file():
        logging.info(f"No such file: {audio_path}")
        return None

    recording_id = Path(audio_info[1]).stem
    recording = Recording.from_file(path=audio_path, recording_id=recording_id)

    segment = SupervisionSegment(
        id=recording_id,
        recording_id=recording_id,
        start=0.0,
        duration=recording.duration,
        channel=0,
        language=language,
        speaker=audio_info[0],
        text=audio_info[3].strip(),
        gender=audio_info[8],
        custom={
            "age": audio_info[7],
            "accents": audio_info[9],
        },
    )
    return recording, segment
laudominik commented 5 months ago

Ok so I guess it's broken at the moment. For now I just use this

cut -f 3 --complement file.tsv

Although a PR with your change will fix the inconvenience, I suppose they might be changing the column order of the dataset in the future and it would have to be done over and over again. Wouldn't it be better to make it parameterized?

pzelasko commented 5 months ago

Thank you for your help in fixing this. Will merge the fix as soon as the PR is ready, parsing the rows into dicts and referring to them by column names is definitely the way to go.