Open laudominik opened 5 months ago
I found the same problem, fixed changing the _parse_utterance function. Probably at some release of the corpus they changed the number of parameters.
def _parse_utterance(
lang_path: Path,
language: str,
audio_info: str,
) -> Optional[Tuple[Recording, SupervisionSegment]]:
audio_info = audio_info.split("\t", -1)
audio_path = lang_path / "clips" / audio_info[1]
if not audio_path.is_file():
logging.info(f"No such file: {audio_path}")
return None
recording_id = Path(audio_info[1]).stem
recording = Recording.from_file(path=audio_path, recording_id=recording_id)
segment = SupervisionSegment(
id=recording_id,
recording_id=recording_id,
start=0.0,
duration=recording.duration,
channel=0,
language=language,
speaker=audio_info[0],
text=audio_info[3].strip(),
gender=audio_info[8],
custom={
"age": audio_info[7],
"accents": audio_info[9],
},
)
return recording, segment
Ok so I guess it's broken at the moment. For now I just use this
cut -f 3 --complement file.tsv
Although a PR with your change will fix the inconvenience, I suppose they might be changing the column order of the dataset in the future and it would have to be done over and over again. Wouldn't it be better to make it parameterized?
Thank you for your help in fixing this. Will merge the fix as soon as the PR is ready, parsing the rows into dicts and referring to them by column names is definitely the way to go.
In
common_voice.py
in_parse_utterance
forSupervisionSegment.text audio_info[2]
(sentence id) is being set rather thanaudio_info[3]
(sentence).For reference: recently downloaded common_voice_pl dataset has the following columns (in .tsv files):
Is this a bug or am I missing something?