hetpandya / youtube_tts_data_generator

A python library to generate speech dataset from Youtube videos
Apache License 2.0
35 stars 8 forks source link

Change default Sample Rate and Signal to Standard for TTS dataset #2

Closed Sadam1195 closed 3 years ago

Sadam1195 commented 3 years ago

Hi, I have found two more issue that I have fixed in your code. I will be sending PR for those too.

hetpandya commented 3 years ago

Hi, @Sadam1195 I have updated the library and added support to change default sample rate. Please let me know if you find any other suggestions. Thanks

Sadam1195 commented 3 years ago

Hi @hetpandya sorry I didn't chance to submit my PR which I already implemented. Great, that you fixed punctuation issue in new update which was very essential for building a dataset for TTS.

One other thing which I found missing in this project is how audio is being split is not very intelligent. You should separate the audio based punctuation. Like on full stop, comma, question mark, or exclamation mark. Because audio should be sliced on some pause not in between the speech which cuts phonemes and vowels and audio length should be not greater than 10 seconds as that is not an ideal case for tts datasets. because longer the audio harder for models to align better and 1-10 seconds of audio chunks is ideal for tts dataset. You can add a flag like max_time like 10 seconds and if audio does not have any full stop in those 10 seconds then it would be better to cut of audio on whichever punctuation is in it because otherwise it would chop off speech. Hope it makes scenes to you.

hetpandya commented 3 years ago

@Sadam1195 I understand about the method for splitting and concatenating, but for the way you mentioned, it will be difficult since only know about the beginning and ending of a caption. I'll try finding a way for it though. Also, as far as I know, youtube captions are generated on the basis of pauses taken in the speech