huggingface / speechbox

Apache License 2.0
342 stars 33 forks source link

ValueError: attempt to get argmin of an empty sequence #28

Open utility-aagrawal opened 11 months ago

utility-aagrawal commented 11 months ago

I am getting the following error when I use the ASR+Diarization pipeline:

image

I understand that it's input data specific so, I am uploading my input video as well:

https://drive.google.com/file/d/1icLyL6H1Kx4NQLJoz4gnEJraalK6uqms/view?usp=sharing

It's happening when you are aligning the diarizer timestamps and the ASR timestamps: image

I was able to avoid it using try except but would like to know your opinion on this. Thanks!

Venkatesh3132003 commented 10 months ago

Facing same issue.

oscarorti commented 8 months ago

I am dealing with that too. Is there any solution or workaround? How does the try/except solution work?

utility-aagrawal commented 8 months ago

@oscarorti Including the following try-except in diarize.py file should fix it temporarily:

image

oscarorti commented 8 months ago

@oscarorti Including the following try-except in diarize.py file should fix it temporarily:

image

Great! Thank you, it worked :)

Pikauba commented 7 months ago

This is wrong to do so and it comes from those lines:

line 167

This phenomenon will be emphasized for longer audio.

Indeed, the speaker assignment algorithm starting at the line 148 removes chunks from whisper prediction and as the predictions removed grow, the following diarized segments have to choose a whisper segment that is always further appart from its real value. As the process goes on, the discrepancy will grow leading to wrongful speaker assignments. The error you faced is just a symptom of the algorithm bad assignment.

I believe that a refactor of the alignement between the speaker detection and the whisper timestamps must be done in order for the result to be accurate.

If you want to make sure of what I am talking about run something along the line:

print(np.min(np.abs(end_timestamps - end_time)), end_time, upto_idx)

in the for loop and for long audio, and you will see that np.min(np.abs(end_timestamps - end_time)) will grow as the it progress in the loop (in my case, I got time differences up to 200 seconds).

2010b9 commented 2 months ago

@oscarorti Including the following try-except in diarize.py file should fix it temporarily:

image

Hey! Hope I'm not too late for the discussion 😁

I have tried your fix and, indeed, the error does not appear anymore, but I'm having some strange behavior. Essentially, the transcription is okay, but the timestamps start not to make sense – instead of being progressively higher, I had timestamps that happened after the 29s, but were framed in the (0.0, 28.2) and (0.0, 22.6) range (screenshot below).

Screenshot 2024-05-20 at 4 17 59 PM

What causes this? Is it related to @Pikauba's answer?

This is wrong to do so and it comes from those lines:

line 167

This phenomenon will be emphasized for longer audio.

Indeed, the speaker assignment algorithm starting at the line 148 removes chunks from whisper prediction and as the predictions removed grow, the following diarized segments have to choose a whisper segment that is always further appart from its real value. As the process goes on, the discrepancy will grow leading to wrongful speaker assignments. The error you faced is just a symptom of the algorithm bad assignment.

I believe that a refactor of the alignement between the speaker detection and the whisper timestamps must be done in order for the result to be accurate.

If you want to make sure of what I am talking about run something along the line:

print(np.min(np.abs(end_timestamps - end_time)), end_time, upto_idx)

in the for loop and for long audio, and you will see that np.min(np.abs(end_timestamps - end_time)) will grow as the it progress in the loop (in my case, I got time differences up to 200 seconds).

Do you guys have a simple way of, after receiving the output of the ASR pipeline and the diarization pipeline, aligning them properly? Also, are there plans to fix this in the sourcecode?

Thanks a lot! 🙂

Pikauba commented 2 months ago

Hello @2010b9, I am not sure I am following here. What have you tried specifically? I have an open pull request with a fix using IoU and optimization. However, the "fix" mentioned in this thread is not the same.

Is your described behaviour coming from the fix here or from my pull request code?

2010b9 commented 2 months ago

Hello @2010b9, I am not sure I am following here. What have you tried specifically? I have an open pull request with a fix using IoU and optimization. However, the "fix" mentioned in this tread is not the same.

Is your described behaviour coming from the fix in this thread or from my pull request code?

Hello! 👋

Sorry, my explanation was confusing. First, I've tried the try-except fix mentioned in this thread. The error disappeared, but the timestamps were not correct (you can see that in the image I've shared in my previous comment). Then, I saw your PR (https://github.com/huggingface/speechbox/pull/35) and I've its code, but I got the same problem – the timestamps were not right.

mantrakp04 commented 2 months ago

lol facing the same issue

Pikauba commented 2 months ago

@2010b9. Is the result (image) coming from my fix? If so, I would guess that there is something with the relative time which got to be added to the timestamps.

Pikauba commented 2 months ago

Ok, so my implementation should not be related to your timestamps error as its only purpose is to match diarization model speaker label prediction to ASR model predictions. Appart from reading whisper timestamps to select the best speaker label (from diarization model) for each whisper's timestamp interval it does not interact with the timestamps predictions.

I don't know were your timestamp error is coming from but I just tested it and it works well on my side.

Maybe this is something related to the model prediction itself (diarization model or whisper)? But I am pretty sure that as what I implemented is only a matching function, it should not create a timestamps bug.

Also, whisper predict chunks of 30 seconds and it seems like your prediction are all in that range. Maybe this is just a relative time problem (absolute value could be inferred from relative timestamps).

2010b9 commented 2 months ago

@2010b9. Is the result (image) coming from my fix? If so, I would guess that there is something with the relative time which got to be added to the timestamps.

No, that exact image is not coming from your fix. But the same happened when I used your fix, I just didn't share the image because the problem was the same.

Ok, so my implementation should not be related to your timestamps error as its only purpose is to match diarization model speaker label prediction to ASR model predictions. Appart from reading whisper timestamps to select the best speaker label (from diarization model) for each whisper's timestamp interval it does not interact with the timestamps predictions.

I don't know were your timestamp error is coming from but I just tested it and it works well on my side.

Maybe this is something related to the model prediction itself (diarization model or whisper)? But I am pretty that as what I implemented is only a matching function this should not create a timestamps bug.

Also, whisper predict chunks of 30 seconds and it seems like your prediction are all in that range. Maybe this is just a relative time problem (absolute value could be inferred from relative timestamps).

Thanks for your answer! 🙂 I have to check the problem further. This is the first time I'm using this and I haven't looked into the code thoroughly.