huggingface / dataspeech

MIT License
281 stars 37 forks source link

Required columns within a dataset #5

Open hyejin111 opened 5 months ago

hyejin111 commented 5 months ago

hello When testing my dataset In metadata_to_text.py script, I saw the error message that Column(s) ['gender'] do not exist" I don't have gender andspeaker_id in my dataset. Do I need both information to run that code?

ylacombe commented 5 months ago

Hey @hyejin111, good catch, thanks for opening the issue!

You indeed need gender and speaker_id to compute per-speaker pitch, as the pitch is computed at the speaker level and as compared to gender (a low-pitched male voice is usually way lower-pitched than a low-pitch voice female).

I have to make this clearer in the README, as well as giving option to make it optional, I'll add this to my TODO!

ittailup commented 5 months ago

I actually think that these are potential future features @ylacombe! I had great success testing this with a single speaker dataset, as I could just mark those columns as they are.

But I also have a very large multi-speaker dataset which is untagged because it is from transcribed sources from multiple documents, and manually tagging speaker identities across ~900k audio files is difficult. Feels like dataspeech could have a step which tries to guess speaker_ids or gender from embeddings. We have pyannote, speechbrain and nemo speaker embeddings available as good open models right now.

With this stage completed, we'd only need an earlier diarization stage + slicing to make dataspeech work for ingesting large audio file datasets-> and produce TTS speech dataset.

Enabling this type of workflow would mean a lot more data for languages, regional accents etc. Maybe these could even have embeddings themselves, so that the datasets could have this encoded into the text descriptions. In my fine-tuning I manually changed "A man speaks" or "A male speaker" to "A {region} man".

ylacombe commented 5 months ago

Hey @ittailup,

Thanks for your insights, this is indeed a very exciting direction, that we could apply to numerous ASR datasets. For gender recognition, there's also this model available in transformers.

Regarding region or accent, this is something that we definitely want to implement and that is being discussed currently in #2 with @MilanaShhanukova.

I don't have the bandwidth yet for any of those improvements, but would love to support and help any community efforts. Would any of you two be interested in implementing those improvements? Once tested and optimized on small datasets, I can probably run the improvements on bigger datasets on HF side and push the results to the hub, which would make a big splash in the community and allow better models (starting with Parler-TTS)!

On a side node, do you have any feedback on the labeling and fine-tuning processes ? Also, I'd be happy to hear samples of the fine-tuned model to share or better, test an available checkpoint if you have released it?

Let me know if any of you are interested! and as said, happy to help