Voice Differentiation Timing

Quidam2k commented 1 year ago

I've noticed that sometimes the ability of the software to distinguish different voices seems a bit inconsistent. I'll notice ">>" in the text and can tell they are meant to indicate a change in the speaker, but they're appearing in the middle of one speaker's chunk of text. Here's an example from the D&D game I was transcribing:

DAVIAN: And the history behind it is that for the lost clan of Dragonborn that Davian's a part of, they followed that same constellation and told stories, or at least his mom told stories, that in following it they would find the place where the raven was safe from both the sun and the rocks and the stars, referencing the fact of how close they were to the spellbook. So it's something that whether or not it's real, he doesn't know, but he sees it

DM: in there and wants to keep that story going. >> Nice. Have a luck point. And Michael, you're brushing up against your mic occasionally. It's kind of making a scuffy sound. If there's something you're doing, stop it. Trappy, what would your answer be for Lazarus? What constellation would Lazarus put up? >>

LAZARUS: Lazarus would probably create an octanary system, a system of eight pulsars that all revolve around each other. Because if you set them up so they have different speeds, no matter where you were in the galaxy, you'd be able to determine where you are and locate yourself. So this would be like the North Star, but the North Star for the entire galaxy, basically anywhere you were, as long as the star pattern existed in your time period.

DM: >> Nice. Kind of like how the moons of Jupiter can be used to tell time sort of a thing. Awesome. Have a luck point. Moving on to Nellie Tubsong. Michael, what constellation would Nellie

NELLY: put in the sky? >> Nellie would put a giant clawfoot tub in the sky. >> Oh. >> In honor of Mama Tubsong's traveling tubs. >> Nice. >> But legend has it that all Tubsongs were conceived in. Not born in, conceived in.

DM: >> Oh, sure, sure. Yes. Have a luck point. I think that you will find yourself not alone in your choice. Because I was playing with a fairy last night, and she only plays the one character. T-Fairy, what would Iree put in the sky? >> Oh, yes. No, in fact, it's probably the

IRYI: very same one that he sees in the sky is Iree's bathtub, which I can pose a picture of it in the thing, and I will in a moment.

kaixxx commented 1 year ago

Interesting, I've never seen ">>" in transcriptions. The whisper AI is trained with many hours of transcribed audio data. For the most part, these are videos with subtitles. So whisper sometimes produces weird artifacts that might have been present in some of it's training data - like this ">>" notation. Anyhow: I have just released a new version of noScribe (0.3) that also tries to improve speaker separation, especially in cases like yours with quick changes. It's not perfect, but you might give it a try. The basic principle is to look at smaller chunks of audio to get a more fine grained speaker separation. You can reduce the chunk size even further by changing the "max-len" value in the advanced options (look at the readme on how to change these options).

Quidam2k commented 1 year ago

I've been working with it more, and am seeing "-" being used in a similar fashion. Where Whisper clearly caught the change in speaker but the speaker seperator didn't until the next chunk began. I'll try further reducing the "max-len" value, but I'm also wondering if it might not be possible to use those characters if they're there to help the speaker ID code do a better job. Seems a shame to waste that data.

kaixxx commented 1 year ago

Interesting observations. I think though that this is not consistent enough to be used in the program, but we will see. The speaker separation needs more work, I agree with that. The problem is not so much pyannote (the "speaker seperator") but the fact that whisper produces not very precise timestamps. This makes the synchronisation between pyannote and whisper difficult, especially if speakers change quickly. If you are interested in the raw output of pyannote, see the log file for your transcript (last bullet point in Advanced options).

Quidam2k commented 1 year ago

Yeah, I kind of figured it wouldn't be a simple matter. I'll keep playing with the max_length value and see if I can bring it closer to true.

It might be worthwhile instead writing a script that will go through a noScribe generated text file looking for a supplied delimiter character and move the indicated text to the next linebreak after the colon. If I give it a go and have luck I'll let you know.

kaixxx commented 1 year ago

looking for a supplied delimiter character and move the indicated text to the next linebreak

You might be able to achieve this with a clever use of search & replace in Word (in multiple steps, you can also turn this into a macro).

kaixxx commented 1 year ago

I have just released vers. 0.4b with a much improved speaker seperation. Give it a try. If you still have problems, open a new issue please.

kaixxx / noScribe

Voice Differentiation Timing #10