srvk / DiViMe

ACLEW Diarization Virtual Machine
Apache License 2.0
32 stars 9 forks source link

incorporate s4d #29

Open alecristia opened 6 years ago

alecristia commented 6 years ago
riebling commented 5 years ago

does this include already-trained models? otherwise there might be more steps:

macw commented 5 years ago

Eric, I am not sure what "this" refers to. For improvement of the DiViMe performance, we want new training data with good human labels. My suggestion is to improve the CHAT data found at https://homebank.talkbank.org/access/Public/VanDam-5minute.html https://homebank.talkbank.org/access/Public/VanDam-5minute.html For purposes of development of a workflow, we could start without any actual improvements by making sure that there is a smooth way to go from the CHAT to s4d and then to training etc. The only trick in this is that we need to make sure that we are preserving all the codes used by LENA. Specifically, we need the list in the desiderata.doc file we created over a year ago. It is in the diarization folder in the SpeechKitchen Google Drive folder. Here is what it lists:

  1. Adult male
  2. Adult female
  3. Child wearing recorder
  4. Other child in environment (not wearing recorder)
  5. Overlaps of the above 4
  6. TV/electronic/radio
  7. Noise
  8. Silence (or noise below some threshold, maybe 30 dB or something)
  9. Garbage/unknown/grab-bag (hopefully very small category)
  10. Possibly identify &=cries, &=yells etc.

Most of these are based on the speaker labels for each utterance, which are given in the CHAT headers as always including @Participants: SIL Silence LENA, MAN Male_Adult_Near Male, MAF Male_Adult_Far Male, FAN Female_Adult_Near Female, FAF Female_Adult_Far Female, CHN Key_Child_Clear Target_Child, CHF Key_Child_Unclear Target_Child, CXN Other_Child_Near Child, CXF Other_Child_Far Child, NON Noise_Near LENA, NOF Noise_Far LENA, OLN Overlap_Near LENA, OLF Overlap_Far LENA, TVN Electronic_Sound_Near Media, TVF Electronic_Sound_Far Media

However, we probably want to mrege the Near and Far types, since they are just based on db level.

The other trick is for #10, which is not in the headers, but in codes such as &=cries inside each line. We need to check about the full list of these. I know they include: crying, vocalization, yells, and vfx.

Eric, please confirm that this message is going to everyone in the group.

Thanks,

-- Brian

On Sep 12, 2018, at 3:42 PM, riebling notifications@github.com wrote:

does this include already-trained models? otherwise there might be more steps:

decide on training data human label the data get training data into s4d format perform training to create models I know the LIUM system comes with pretrained models derived from - is it French or Estonian? - broadcast news — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/DiViMe/issues/29#issuecomment-420772908, or mute the thread https://github.com/notifications/unsubscribe-auth/AFXzuY4KFUcCHom-b8M7tZZQBuN7SutOks5uaWOygaJpZM4WaOms.

riebling commented 5 years ago

I looked through all the linked code and could find no pre-trained models, so can only assume that this requires training and therefore human-labeled data. Isn't this a Python port of the LIUM diarization tool? So we can actually import a 'working' LIUM system, as the same one that runs in the EESEN Transcriber VM, including pretrained models. It does speaker clustering (or not [faster]), gender detection, and works surprisingly well considering the models it was trained on were French broadcast news transcriptions.

Of course it will be great if we can train a new system, on our type of data, with Python code (more popular than Java these days) - so this could take a bit more work. I just wanted to point out the "off the shelf" Java version, and how it used to even be installed by the DiViMe Vagrantfile but was removed because nobody used it :)