liyunlongaaa / NSD-MS2S

CHIME-7/8 diarization champion system: neural speaker diarization using memory-aware multi-speaker embedding with sequence-to-sequence architecture
61 stars 4 forks source link

diarization #5

Open dutchsing009 opened 6 months ago

dutchsing009 commented 6 months ago

Would this algorithm be efficient or enough to diarize a video like that Video or is it an overkill. having known that there should be no overlapping speakers at all atleast 99% of the times. and if yes how should I start? a lil guide would be amazing.

liyunlongaaa commented 6 months ago

I think this kind of animation diarization is more easy to do, you can use the code in this repo to achieve, you can also do multi-modal speaker diarization, this kind of animation accompanied by obvious lip movement is easy to capture, so it may be better than only voice for the effect, this is my suggestion.

dutchsing009 commented 6 months ago

Thank you so much , i will try your code and let you know . but are there any suggestions for multi-modal speaker diarization? Like what's the best repo in your opinion that would fit my case. Thanks in advance

liyunlongaaa commented 6 months ago

Unfortunately, as far as I know, multimodal diarization doesn't work very well for open source right now. And I think the premise that multimodal diarization is relatively easy is still to need the corresponding training data, if there is no data that is also not easy.

liyunlongaaa commented 6 months ago

But I can introduce you to the latest multimodal diarization sota, https://arxiv.org/html/2401.08052v2

liyunlongaaa commented 6 months ago

Here is our team previous work, https://github.com/mispchallenge/misp2022_baseline/tree/main/track1_AVSD, Although it doesn't work that well, it's one of the few multimodal diarization open source projects I know.

dutchsing009 commented 6 months ago

It is ok thanks for all these info , btw i talked to the author of this https://arxiv.org/pdf/2312.05730.pdf and he said the most similar one to it is this https://github.com/showlab/AVA-AVD. But anyways i will use your code for starters. last thing, are there any things i need to do with your code for my use , like should i modify something here or there , what do you think ?

liyunlongaaa commented 6 months ago

You should first prepare the training set according to the README, go to the Internet to find some open-source English diarization data, extract the fbank features of audio, and follow the training instructions in README. If you do not have any audio signal processing foundation and may be in some trouble, please let me know what you encounter.