Total duration of segments after filtering bad segments is less than result in paper

ngocson1804 commented 2 years ago

Hi, I ran step 4 and 5 using the file /data/ja/202103.csv you provided. I got more than 10M files with a total duration of over 10,000 hours for all segments. But after filtering bad segments with min_confidence_score=-0.3, the total of number of good segments is only about 480,000 with a total duration of 351 hours. So, the yield is roughly 3.5% and the total duration is much less than what you mentioned in the paper (1,300 hours). Do you know the possible reasons?

vebmaylrie commented 2 years ago

Please decrease the threshold. We used -3.0 to obtain >1300 hour data.

ngocson1804 commented 2 years ago

Thank you for the suggestion! I tried using the threshold -3.0 and got 5.7 million segments for a total duration of 6,046 hours, which is way more than 1,300 hours. So, I checked your paper more carefully and it seems that you applied the -3.0 threshold only to the top 15k videos and the single-speaker subset to get 1,376 hours. Meanwhile, I got a total of over 100,000 videos. So, is there any reason to use only the top 15k videos? Should I use all of the 100k videos to get 6,046 hours of segments with confident score over -3.0?

Also, is there any chance you could share the dev_easy_jun21, eval_easy_jun21, dev_normal_jun21 and eval_normal_jun21 sets?

vebmaylrie commented 2 years ago

I got a total of over 100,000 videos. So, is there any reason to use only the top 15k videos? There is no special reason. Our experiments using 15k videos were pilot studies.

sarulab-speech / jtubespeech

Total duration of segments after filtering bad segments is less than result in paper #13