using aclewStarter dataset for evaluation

srvk / DiViMe

ACLEW Diarization Virtual Machine

Apache License 2.0

32 stars 9 forks source link

using aclewStarter dataset for evaluation #57

Closed jihopark closed 6 years ago

jihopark commented 6 years ago

Hi,

I tried to downloaded aclewStarter dataset using ./getAclewStarter.sh but only got 2 audio files (BER_0485_12_07_09123 & BER_0713_07_02_21041). I am guessing this is because of the private status of the audio files? i was peeking at issue #15 and seems like u guys could get 4 files, but I cannot for some reason :(

Also, I listened to the two files and both files are from children who are too young to vocalize much sound (and the audio contains very small portion of their speech, and a lot of crying). From what I read from LENA's report, their algorithms filter out non-speech sound. For this reason, I feel that evaluating the functionalities with these samples are too limited.

I have downloaded Homebank dataset (VanDam-5minutes), but do not have the exact labels for SAD.

Do you have a recommendation of any other publicly available dataset for evaluation purpose?

alecristia commented 6 years ago

On the ACLEW starter dataset, it seems there is something going on with the archive Databrary. We're not sure why that happens.

Actually, we do count crying as a vocalization. So ideally we should have VADs, not SADs. I'm sorry that we are not clear on that in the documentation! The LENA system also counts crying as part of children's vocalizations - here's an example from their native file format (.its): <Segment spkr="CHN" average_dB="-17.98" peak_dB="-8.57" conversationInfo="|EC|1|1|1|AICM|TIMR|FI|" childUttCnt="1" childUttLen="P0.73S" startUtt1="PT15.32S" endUtt1="PT16.05S" childCryVfxLen="P0.00S" startTime="PT15.32S" endTime="PT16.05S" />

The bit in bold indicates this child voc contained crying.

Unfortunately, public datasets (other than ACLEW starter) don't have very precise boundary annotations, but you're right to look at vanDam's datasets. He actually also has a daylong one. If you know how to process .its or .cha files, you can also get those from homebank. We're working on easy-to-use tools to facilitate .its and .cha processing but I don't think they're ready yet. You'll find a lot of useful code on homebankCode: https://github.com/homebankCode

jihopark commented 6 years ago

@alecristia Thanks for your reply!

So there is no dataset on your paper (The ACLEW DiViMe: An easy-to-use diarization tool) that I can use as benchmark? I would like to test out new SAD/VAD solutions to measure the performance like u did on Table 1,2,3.

alecristia commented 6 years ago

That would be terrific! You should use the DiHARD challenge data (last line of each table, if I remember correctly) as benchmarking input: https://coml.lscp.ens.fr/dihard/index.html

As of a month ago, the data were going to be released via LDC "very soon", but you should sign up to the mailing list to get the notice. The data will be free for LDC members, and probably pretty cheap otherwise, but let me know if that's a hurdle.

Please, do keep me posted if you can access it!

jihopark commented 6 years ago

@alecristia How can I guess access to the dvelopment/evaluation data for DiHARD challenge data? I cannot find a link in https://coml.lscp.ens.fr/dihard/data.html . It seems like the competition already finished, so I cannot participate anymore. I know you are one of the organizers, so can you help me get access?