Long recordings with an unknown number of speakers

NormanTUD commented 3 years ago

Hi, this is a great project I've been waiting for for quite some time, and it works really really exceptionally well. So thanks for that, first.

But I want to achieve something quite complex I guess.

I have long audio recordings in which several people speak, and I do not know in advance how many there are.

My problems:

I do not have a big GPU or much RAM, so I need to split them into junks. But then, I lose information. If let's say 3 people speak, and from 00:00:00 to 00:02:00 (the chunk size that works on my computer) only the first 2 persons speak (lets call them speaker 1 an speaker 2), and then I go to the next chunk, where only speaker 2 and speaker 3 speak, then they would both be only "Speaker 1" and "Speaker 2", since I cannot find a way to carry information about previous utterances of different speakers to the new run. Is there any way to do that?
I do not know how many speakers there are and when exactly they spoke. I don't care about giving them names, I only want a list "speaker 1 spoke from ... to ... and from ... to ... and so on, and speaker 2 from ... to ... and so on, ..., and speaker n from ... to ... and so on).

Is this somehow realizable with this tool? I'm by no means an expert on how to tinker with the source code properly to achieve that, but it would be of immense help to have that ability, since Resemblyzer is the only diarization tool I've come across that really works as expected.

milind-soni commented 3 years ago

Hi, this is a great project I've been waiting for for quite some time, and it works really really exceptionally well. So thanks for that, first.

But I want to achieve something quite complex I guess.

I have long audio recordings in which several people speak, and I do not know in advance how many there are.

My problems:

I do not have a big GPU or much RAM, so I need to split them into junks. But then, I lose information. If let's say 3 people speak, and from 00:00:00 to 00:02:00 (the chunk size that works on my computer) only the first 2 persons speak (lets call them speaker 1 an speaker 2), and then I go to the next chunk, where only speaker 2 and speaker 3 speak, then they would both be only "Speaker 1" and "Speaker 2", since I cannot find a way to carry information about previous utterances of different speakers to the new run. Is there any way to do that?

I do not know how many speakers there are and when exactly they spoke. I don't care about giving them names, I only want a list "speaker 1 spoke from ... to ... and from ... to ... and so on, and speaker 2 from ... to ... and so on, ..., and speaker n from ... to ... and so on).

Is this somehow realizable with this tool? I'm by no means an expert on how to tinker with the source code properly to achieve that, but it would be of immense help to have that ability, since Resemblyzer is the only diarization tool I've come across that really works as expected.

Hey! did you find any solution to this problem?

NormanTUD commented 3 years ago

Hey! did you find any solution to this problem?

No, I have not yet found a solution. Sorry.

resemble-ai / Resemblyzer

Long recordings with an unknown number of speakers #62