juanmc2005 / diart

A python package to build AI-powered real-time audio applications
https://diart.readthedocs.io
MIT License
1.1k stars 90 forks source link

quality concerns #229

Closed DmitriyG228 closed 5 months ago

DmitriyG228 commented 10 months ago

It looks like pipeline quickly forgets previous speakers, assigning wrong tags to new ones, so that a conversation of 4-5 people being inferenced as a conversation of 2.

I am testing alongside with whisperx, which seem to be using same set of default models, though gives better results.

Before diving deeper into the debugging, is there an obvious number of things I could be doing wrong? I tried non-default embedding model with same result.

juanmc2005 commented 9 months ago

@DmitriyG228 you can check out other related issues like #4, #133 and #226 where this was already discussed