Infinite loop in Remote Transcription.

rafaelcsch commented 1 month ago

Am using 4.1.6. In some rare cases we found that a problematic audio file caused the remote transcription node to crash (using Wav2Vec). The node ends up being automatically restarted. So, when the node crash, the audio file ends up back in the queue for a new transcription attempt and cause another node to crash. This made the entire process an infinite circle. Would there be a max retries parameter that, if exceeded, would cause the file to be skipped? If not, it might be interesting to implement something like this.

wladimirleite commented 1 month ago

Regardless of the retry limit, it would be nice to have these problematic audios, so the root cause could be investigated.

lfcnassif commented 1 month ago

Hi @rafaelcsch, could you provide a problematic audio sample so we can try to reproduce?

lfcnassif commented 1 month ago

Answering the question, we have a MAX_CONNECT_ERRORS limit, so the client aborts when there are too many connection errors, possibly because the cluster is down.

The transcription algorithm receives raw audio features and I thought it shouldn't cause errors passing it through a neural network, it would just return crazy results. But I don't know the implementation details and if input vectors have feature values out of (possible) expected ranges, it may cause issues indeed...

Anyway, we would need a triggering sample to test a possible implementation of the new retry limit.

lfcnassif commented 1 month ago

@rafaelcsch, I've just remembered this old issue we faced more than 1.5 years ago, related to sudden reboots while transcribing some audios: https://github.com/pytorch/pytorch/issues/3022

It seems to be a hardware issue, not a software one. Instead of replacing power supply units, as many comments suggested, our solution was this: https://github.com/pytorch/pytorch/issues/3022#issuecomment-1249795228

I guess you are experiencing a similar issue. Unfortunately, without a sample to try to reproduce, there is nothing we can do, so I'm closing this as can't reproduce. If you can share a triggering audio sample, please reopen.

lfcnassif commented 1 month ago

PS: It used to happen in our nodes with valid audios from a very specific small data set, while it didn't happen with our large test set.

lfcnassif commented 1 month ago

@rafaelcsch just to confirm, by crash you mean a machine reboot, right?

rafaelcsch commented 1 month ago

Sorry, the audio file is from a sensitive case, so I need to verify if it's possible to share it. Regarding the crash, it's not a reboot; only the process crashes and automatically restarts the listening connections. In the client, the log file contains:

2024-05-03 00:07:36 [ERROR] [task.transcript.RemoteWav2Vec2TranscriptTask] Error 1 in communication with 172.21.43.130:5005: Exception while transcribing: iped.engine.task.transcript.ProcessCrashedException: External transcription process crashed.. The audio will be retried. 2024-05-03 00:07:36 [WARN] [task.transcript.RemoteWav2Vec2TranscriptTask] Network error communicating to server: 172.21.43.130:5005, retrying audio: FAV1772663-SAMSUNG G780G_Relatório.ufdr/samsung_SM-G780G.zip/data/media/0/Android/media/com.whatsapp/WhatsApp/Media/WhatsApp Voice Notes/202247/PTT-20221116-WA0037.opus java.net.SocketException: Exception while transcribing: iped.engine.task.transcript.ProcessCrashedException: External transcription process crashed. at iped.engine.task.transcript.RemoteWav2Vec2TranscriptTask.transcribeAudio(RemoteWav2Vec2TranscriptTask.java:272) [iped-engine-4.1.6.jar:?] at iped.engine.task.transcript.AbstractTranscriptTask.process(AbstractTranscriptTask.java:399) [iped-engine-4.1.6.jar:?] at iped.engine.task.transcript.AudioTranscriptTask.process(AudioTranscriptTask.java:41) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:277) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:192) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.sendToNextTask(AbstractTask.java:225) [iped-engine-4.1.6.jar:?] at iped.engine.task.AbstractTask.processAndSendToNextTask(AbstractTask.java:205) [iped-engine-4.1.6.jar:?] at iped.engine.core.Worker.process(Worker.java:177) [iped-engine-4.1.6.jar:?] at iped.engine.core.Worker.run(Worker.java:265) [iped-engine-4.1.6.jar:?]

So, we are currently using the faster-whisper algorithm instead of wav2vec2 (similar to the workaround described in #1335 , but with GPU). After some tests, we found that the issue does not occur with the original wav2vec2, indicating a potential error in faster-whisper that requires investigation.

Regarding MAX_CONNECT_ERRORS, I was concerned that it might not be respected, as the log contains more than 800 errors (same error about crash) related to a single file, 'PTT-20221116-WA0037.opus'

lfcnassif commented 1 month ago

Sorry, the audio file is from a sensitive case, so I need to verify if it's possible to share it.

You can share it privately to me if that is acceptable, my email is in my profile. If not, maybe you can try to create an artificial audio with similar characteristics that triggers the issue and without sensitive info that could be shared.

So, we are currently using the faster-whisper algorithm instead of wav2vec2 (similar to the workaround described in #1335 , but with GPU). After some tests, we found that the issue does not occur with the original wav2vec2, indicating a potential error in faster-whisper that requires investigation.

So could you share your code? How much memory does your GPU have? Maybe it is being caused by OOM... Does it happen when using the CPU? If not, maybe it is a version mismatch among your CUDA driver, CUDA Toolkit and Pytorch with CUDA support.

Regarding MAX_CONNECT_ERRORS, I was concerned that it might not be respected, as the log contains more than 800 errors (same error about crash) related to a single file, 'PTT-20221116-WA0037.opus'

MAX_CONNECT_ERRORS is used to check if the whole cluster is down or unreachable, when a request is accepted, the counter is reset. It aims to tolerate temporary network issues.

sepinf-inc / IPED

Infinite loop in Remote Transcription. #2201