sarulab-speech / jtubespeech

Apache License 2.0
213 stars 46 forks source link

Total duration of segments after filtering bad segments is less than result in paper #13

Open ngocson1804 opened 2 years ago

ngocson1804 commented 2 years ago

Hi, I ran step 4 and 5 using the file /data/ja/202103.csv you provided. I got more than 10M files with a total duration of over 10,000 hours for all segments. But after filtering bad segments with min_confidence_score=-0.3, the total of number of good segments is only about 480,000 with a total duration of 351 hours. So, the yield is roughly 3.5% and the total duration is much less than what you mentioned in the paper (1,300 hours). Do you know the possible reasons?

vebmaylrie commented 2 years ago

Please decrease the threshold. We used -3.0 to obtain >1300 hour data.

ngocson1804 commented 2 years ago

Thank you for the suggestion! I tried using the threshold -3.0 and got 5.7 million segments for a total duration of 6,046 hours, which is way more than 1,300 hours. So, I checked your paper more carefully and it seems that you applied the -3.0 threshold only to the top 15k videos and the single-speaker subset to get 1,376 hours. Meanwhile, I got a total of over 100,000 videos. So, is there any reason to use only the top 15k videos? Should I use all of the 100k videos to get 6,046 hours of segments with confident score over -3.0?

Also, is there any chance you could share the dev_easy_jun21, eval_easy_jun21, dev_normal_jun21 and eval_normal_jun21 sets?

vebmaylrie commented 2 years ago

I got a total of over 100,000 videos. So, is there any reason to use only the top 15k videos? There is no special reason. Our experiments using 15k videos were pilot studies.