orcasound / aifororcas-podcast

Tool for exploring and improving AI detection of SRKW calls to generate expert-labeled data -- prototype at https://ai4orcas.net/portfolio/pod-cast-annotation-system/
MIT License
5 stars 1 forks source link

Remove negative data from samples from PodCast rounds 1-10 #8

Open scottveirs opened 2 years ago

scottveirs commented 2 years ago

From Zoe via Orcasound leading up to 2022 Microsoft hackathon:

I found a few conflicting data points in the training annotation data over the weekend. The file is annotations.tsv from s3://acoustic-sandbox/labeled-data/detection/train/TrainDataLatest_PodCastAllRounds_123567910.tar.gz. There are 12 data files that are labeled both positive samples (starting_time=0, duration_s>0) and negative samples (starting_time=0, duration_s=0). The negative entries probably should be removed. See the screenshots blow. I don't think it would have a big impact on the training results (it's only 12 samples out of thousands) but it would be nice to clean it up with the upcoming hackathon. Or alternatively, if people are aware, they can also remove them manually while loading the data. If there's someone working on generating/updating the labeled data this year, I can also forward those to them.

training_overlap_neg

trainning_overlap_pos

scottveirs commented 2 years ago

@liu-zoe @akashmjn I added this as part of PodCast related challenges for the 2022 Microsoft hackathon here, but if we don't get to it I'll re-visit with the HALLO folks who are also trying to improve and increase the labeled data for SRKW calls.