taylorlu / Speaker-Diarization

speaker diarization by uis-rnn and speaker embedding by vgg-speaker-recognition
Apache License 2.0
464 stars 121 forks source link

Slow performance? #20

Open chrisspen opened 4 years ago

chrisspen commented 4 years ago

How long should speakerDiarization.py take to run on a typical system with a GPU?

I ran the speakerDiarization.py example and it segmented the file correctly. However, it's very slow. It takes about twice the length of the wav file to run, which makes it impractical to run on large files. For example, the sample file is about 2 minutes long, and it took 4 minutes to process. I also tested it with a custom wav file that was 3 minutes long, and it took 6 minutes to run.

My system is 2.80GHz quadcore with 32GB of memory.

Is this the typical processing time? Is there anyway to speed up processing?

taylorlu commented 4 years ago

I'm afraid you should reimplement the uis-rnn by yourself if you need speed up, and the short-coming of the original uis-rnn also obviously since you have figure it out in #16

chrisspen commented 4 years ago

I guess I'm more curious what typical performance is like on a system with a GPU, since I don't have one and the code is heavily dependent on Torch and Tensorflow, which are both optimized for GPUs. I don't have any systems on hard built to do high performance computing so I can't test it myself. Do you have any reference numbers to share? How long does it take to process the rmdmy.wav file on a system with a GPU?

I'd like to do more testing, and I'm trying to decide if I should invest in a GPU, but I don't want to waste my time if it's only a minor speed up.

taylorlu commented 4 years ago

Sorry, I haven't test it on GPU. However, you can adjust the parameters in ghostvlad, such as hop_length to reduce the piece count of the whole wav file.

chrisspen commented 4 years ago

Wouldn't reducing the hop_length effectively reduce the resolution at which a unique speaker can be detected, and result in more errors?

I just rented a gpu EC2 instance with 15GB of gpu memory and ran a test there, and found the code performs about 3.4x faster than on a cpu-only system.

chrisspen commented 4 years ago

Do you have any thoughts on how you could translate the speaker labels from one run to another? I'm thinking, instead of running uisrnn on the whole file, I'll split it into parts and then run it on each file in parallel. That would drastically speed it up. You'd then stitch the speaker labels together to get the complete speaker segmentation. The only problem is correlating the different speaker labels, since each run would potentially refer to completely different speakers.

taylorlu commented 4 years ago

I think there will be more complexity since you should compute the similarity of each speaker segmentation by different parallel threads. And the uis-rnn seems to require store the speaker Id which has processed before since it use ddCRP(distance dependent Chinese restaurant process) model.