incorporate s4d - Githubissues

alecristia commented 6 years ago

[ ] git clone https://git-lium.univ-lemans.fr/Meignier/s4d.git
[ ] all the following probably require us to dig into http://www-lium.univ-lemans.fr/sidekit/s4d/tutorial.html
[ ] create bash scripts allowing to convert our standard rttm into and back from their preferred lium format: columns are "show start duration gender channel env speaker, where show is the recording name, start is the segment start, duration is the segment length, gender is the speaker gender, channel is the band type (for instance: telephone or studio), env is the environment type (for instance: music or speech only) and speaker is the speaker label." (from their Interspeech 2018 paper)
[ ] create bash script allowing user to call their segmentation system
[ ] create bash script allowing user to call their 4 diarization options (automatically include the clustering step into these)
[ ] benchmark against all other extant options on the Tsimane dataset

riebling commented 5 years ago

does this include already-trained models? otherwise there might be more steps:

decide on training data
human label the data
get training data into s4d format
perform training to create models I know the LIUM system comes with pretrained models derived from - is it French or Estonian? - broadcast news

macw commented 5 years ago

Eric, I am not sure what "this" refers to. For improvement of the DiViMe performance, we want new training data with good human labels. My suggestion is to improve the CHAT data found at https://homebank.talkbank.org/access/Public/VanDam-5minute.html https://homebank.talkbank.org/access/Public/VanDam-5minute.html For purposes of development of a workflow, we could start without any actual improvements by making sure that there is a smooth way to go from the CHAT to s4d and then to training etc. The only trick in this is that we need to make sure that we are preserving all the codes used by LENA. Specifically, we need the list in the desiderata.doc file we created over a year ago. It is in the diarization folder in the SpeechKitchen Google Drive folder. Here is what it lists:

Adult male
Adult female
Child wearing recorder
Other child in environment (not wearing recorder)
Overlaps of the above 4
TV/electronic/radio
Noise
Silence (or noise below some threshold, maybe 30 dB or something)
Garbage/unknown/grab-bag (hopefully very small category)
Possibly identify &=cries, &=yells etc.

Most of these are based on the speaker labels for each utterance, which are given in the CHAT headers as always including @Participants: SIL Silence LENA, MAN Male_Adult_Near Male, MAF Male_Adult_Far Male, FAN Female_Adult_Near Female, FAF Female_Adult_Far Female, CHN Key_Child_Clear Target_Child, CHF Key_Child_Unclear Target_Child, CXN Other_Child_Near Child, CXF Other_Child_Far Child, NON Noise_Near LENA, NOF Noise_Far LENA, OLN Overlap_Near LENA, OLF Overlap_Far LENA, TVN Electronic_Sound_Near Media, TVF Electronic_Sound_Far Media

However, we probably want to mrege the Near and Far types, since they are just based on db level.

The other trick is for #10, which is not in the headers, but in codes such as &=cries inside each line. We need to check about the full list of these. I know they include: crying, vocalization, yells, and vfx.

Eric, please confirm that this message is going to everyone in the group.

Thanks,

-- Brian

On Sep 12, 2018, at 3:42 PM, riebling notifications@github.com wrote:

does this include already-trained models? otherwise there might be more steps:

decide on training data human label the data get training data into s4d format perform training to create models I know the LIUM system comes with pretrained models derived from - is it French or Estonian? - broadcast news — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/DiViMe/issues/29#issuecomment-420772908, or mute the thread https://github.com/notifications/unsubscribe-auth/AFXzuY4KFUcCHom-b8M7tZZQBuN7SutOks5uaWOygaJpZM4WaOms.

riebling commented 5 years ago

I looked through all the linked code and could find no pre-trained models, so can only assume that this requires training and therefore human-labeled data. Isn't this a Python port of the LIUM diarization tool? So we can actually import a 'working' LIUM system, as the same one that runs in the EESEN Transcriber VM, including pretrained models. It does speaker clustering (or not [faster]), gender detection, and works surprisingly well considering the models it was trained on were French broadcast news transcriptions.

Of course it will be great if we can train a new system, on our type of data, with Python code (more popular than Java these days) - so this could take a bit more work. I just wanted to point out the "off the shelf" Java version, and how it used to even be installed by the DiViMe Vagrantfile but was removed because nobody used it :)

srvk / DiViMe

incorporate s4d #29