huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.05k stars 878 forks source link

tracking: support silero-vad v5 #2319

Closed shua closed 2 weeks ago

shua commented 2 months ago

This ticket tracks my efforts to get silero-vad v5 running with candle. I have a proof-of-concept branch working, but I'm trying to merge the changes in meaningful bits.

features:

bugs:

shua commented 2 months ago

@LaurentMazare you asked in https://github.com/huggingface/candle/pull/2251#issuecomment-2153366138 whether I had a vad example. I've gotten it working locally after fixing the above bugs or implementing the above features. #2321 would be the final culmination of these tickets, which I can open for review once the other tickets have been merged.

shua commented 2 months ago

I don't know if you're interested, but as part of this work I have a local test harness which will run an onnx model in candle-onnx and ort and then if they differ in output, it will find specifically which node they differ. That's how I was able to find the bugs listed above, and it was a lot quicker than trying to make pytorch examples.

I can try to PR that if you'd like but not sure if it should be in its own package or some feature-gate on candle-onnx to avoid depending on ort in general.

shua commented 1 month ago

ping @LaurentMazare is there anything I can do to make these PRs easier to review?

LaurentMazare commented 3 weeks ago

Sorry for the delay, the LSTM part looked good so I just merged it. Re the two bug fixes, I'm a bit less enclined to do so as they are currently conscious restrictions of the pow and select operators. Are these actually needed for this model?

shua commented 3 weeks ago

The specific use in silero models is index by -1 and raising tensor (which contains negative numbers) to a scalar power of 2. I could special case fixes in the cpu backend or onnx simple eval to catch these cases and do something equivalent (eg repeated mul instead of pow), but I suspect that would lead to more bugs filed like "why can I index_select with -1 but not -2?"

I also suspect (but cannot confirm) that this behaviour is already correct in the gpu backends but only broken in the cpu backend.

shua commented 3 weeks ago

Also, if I knew why you are avoiding using the system's pow builtins, I could try to accommodate that?

LaurentMazare commented 3 weeks ago

The goal is mainly to reduce the number of operations that each backend must implement, hence the binary pow implemented with exp/log at the moment. Note that there is a unary pow operation powf that takes as input the same exponent for all the elements of the tensor and this one is the most likely to be used with negative inputs though I agree it doesn't cover all possible use cases.

shua commented 3 weeks ago

huh, I think that is this usecase, but it looks like onnx simpl_eval simply calls broadcast_pow when evaluating the "Pow" node type. I could instead add a check for scalar power and branch on that whether it calls powf vs broadcast_pow in the onnx code, or I could add a branch in broadcast_pow to fallback to powf if conditions match. I'm leaning towards keeping hacks in the edges at simple_eval rather than deeper.

LaurentMazare commented 3 weeks ago

Yeah that sounds right to me, optimizing nodes that are Pow with the second arg being a Constant using a single value is likely to cover most use cases and would keep the hacks in simple_eval or around.

shua commented 2 weeks ago

Okay, found a similarly focused fix for the index_select issue. Once #2440 is merged, I can open #2321 to add a vad example.

LaurentMazare commented 2 weeks ago

I can open #2321 to add a vad example.

Yes that seems like a great example to include, especially as we don't have much audio examples yet.

louis030195 commented 2 weeks ago

@shua great addition, we use web rtc vad atm and will try soon https://github.com/mediar-ai/screenpipe/issues/241

wonder if the silero model itself is better than whisper? we have lot of issues with whisper (distil large v3) "thank you" and stuff like this when nobody speaks (but prob the silero vad will solve this) because we record all audio devices 24/7

shua commented 5 days ago

I can't speak to the quality of the model. I came upon silero-vad because it's an optional part of https://github.com/rhasspy/wyoming-satellite to put in front of openwakeword or whisper. More generally, voice activity detection (vad) is solving different problems from speech to text like whisper, so I don't know if it makes sense to compare the two.