Common voice wrong metadata added to supervision set

laudominik commented 5 months ago

In common_voice.py in _parse_utterance for SupervisionSegment.text audio_info[2] (sentence id) is being set rather than audio_info[3] (sentence).

For reference: recently downloaded common_voice_pl dataset has the following columns (in .tsv files):

client_id
path
sentence_id
sentence
sentence_domain
up_votes
down_votes
age
gender
accents
variant
locale
segment

Is this a bug or am I missing something?

daniel-dona commented 5 months ago

I found the same problem, fixed changing the _parse_utterance function. Probably at some release of the corpus they changed the number of parameters.

def _parse_utterance(
    lang_path: Path,
    language: str,
    audio_info: str,
) -> Optional[Tuple[Recording, SupervisionSegment]]:
    audio_info = audio_info.split("\t", -1)
    audio_path = lang_path / "clips" / audio_info[1]

    if not audio_path.is_file():
        logging.info(f"No such file: {audio_path}")
        return None

    recording_id = Path(audio_info[1]).stem
    recording = Recording.from_file(path=audio_path, recording_id=recording_id)

    segment = SupervisionSegment(
        id=recording_id,
        recording_id=recording_id,
        start=0.0,
        duration=recording.duration,
        channel=0,
        language=language,
        speaker=audio_info[0],
        text=audio_info[3].strip(),
        gender=audio_info[8],
        custom={
            "age": audio_info[7],
            "accents": audio_info[9],
        },
    )
    return recording, segment

laudominik commented 5 months ago

Ok so I guess it's broken at the moment. For now I just use this

cut -f 3 --complement file.tsv

Although a PR with your change will fix the inconvenience, I suppose they might be changing the column order of the dataset in the future and it would have to be done over and over again. Wouldn't it be better to make it parameterized?

pzelasko commented 5 months ago

Thank you for your help in fixing this. Will merge the fix as soon as the PR is ready, parsing the rows into dicts and referring to them by column names is definitely the way to go.

lhotse-speech / lhotse

Common voice wrong metadata added to supervision set #1325