It looks like pipeline quickly forgets previous speakers, assigning wrong tags to new ones, so that a conversation of 4-5 people being inferenced as a conversation of 2.
I am testing alongside with whisperx, which seem to be using same set of default models, though gives better results.
Before diving deeper into the debugging, is there an obvious number of things I could be doing wrong? I tried non-default embedding model with same result.
It looks like pipeline quickly forgets previous speakers, assigning wrong tags to new ones, so that a conversation of 4-5 people being inferenced as a conversation of 2.
I am testing alongside with whisperx, which seem to be using same set of default models, though gives better results.
Before diving deeper into the debugging, is there an obvious number of things I could be doing wrong? I tried non-default embedding model with same result.