Making the chunk size not constant for real-time audio purposes

Unbreathable commented 7 months ago

Hello, sorry for giving you more work again and creating a new issue since I already made a PR today, but there is a feature missing in your crate that I would like to add by PR, it would be a pretty huge breaking change probably because it would require some huge changes in the code base.

I would like to suggest making the chunk_size a value in the VoiceActivityDetector struct instead of a generic value since it is otherwise impossible to write code like this:

let sample_rate: u32 = default_config.sample_rate().0;
let buffer_size = sample_rate / 16;
let _work_channels = 1; // Stereo doesn't work at the moment (will fix in the future or never)
let mic_channels = default_config.channels();
let config: StreamConfig = StreamConfig {
    channels: mic_channels,
    sample_rate: cpal::SampleRate(sample_rate),
    buffer_size: cpal::BufferSize::Fixed(buffer_size),
};

let mut vad = VoiceActivityDetector::<buffer_size>::try_with_sample_rate(
    default_config.sample_rate().0,
)
.expect("how dare you");

The above code will error because buffer_size isn't constant, and there is no way to make it constant since it has to be computed every time this code runs. Since this code always gets a different sample rate, I would like to be able to compute the chunk_size based on the sample rate. If I just set a constant, it will error with different sample rates as well, just like when going with something as low as 256 as a constant for the chunk_size the code errors when passing in a sample_rate above 44100 (that's what I tested with). It would be cool to be able to compute this on the fly using something like I have shown in the example here. The only way to currently work around this is to have different recorders for different sample rates, meaning you write the same code like 5 times just to get this value to change. If there is another way that fixes this issue, please let me know, I'm kind of new to Rust, so I'm sorry if there's something basic I'm overlooking here.

Thanks for your work and I hope we can get this resolved, Julian.

Unbreathable commented 7 months ago

Just for reference, something like this would also error, because it doesn't have the same types:

let mut vad = if default_config.sample_rate().0 > 32000 {
    VoiceActivityDetector::<512>::try_with_sample_rate(default_config.sample_rate().0)
        .expect("how dare you")
} else {
    VoiceActivityDetector::<256>::try_with_sample_rate(default_config.sample_rate().0)
        .expect("how dare you")
};

Unbreathable commented 7 months ago

I've made a potential draft for this at https://github.com/Liphium/voice_activity_detector_fixed. Adding a DynamicVoiceActivityDetector to fix this particular use case. I just couldn't get the normal API to work with the iterators due to issues with the constant usize for the LabeledAudio struct. Probably just a skill issue, but I don't know. The thing is that this is probably just a temporary fix for something that needs better addressing in the future. I also added a test called wave_file_dynamic_detector to test the thing. Let me know if you want me to make a pull request with this, because this is a pretty big modification and I don't know what vision you have for this library.

nkeenan38 commented 7 months ago

Yeah this is definitely something we can add. Let me take a closer look later today.

The original design of this went far on the spectrum of compile time guarantees. That meant both chunk size and sample rate were const generics and type states ensured that the VAD's were always using a valid configuration. The ergonomics were horrible though when you didn't know your sample rate until runtime. I assumed the chunk size would be known, but it seems not. Since then, I've also relaxed inputs to use iterators instead of arrays.

That being said. Two separate structs may be the best way to go. It would let us optimize the case where we do know all this information up front. But I'd also like to just consider the implications of removing the const generic on the original.

Arrays will become Vecs, but with a known capacity at creation so we won't have a situation where the vec will reallocate. Ultimately, the only real difference is whether these sample chunks are stored on the stack or the heap. And whether they can be Copied. The performance implications are probably minimal and without any benchmarks, I'm not going to assume anything.

So lets see what just removing those const generics would look like. Would you like to take a stab at it and create a PR? If not, I can get around to it tonight.

Unbreathable commented 7 months ago

I've already implemented an extra struct called DynamicVoiceActivityDetector at https://github.com/Liphium/voice_activity_detector_fixed that has a chunk_size value in it. I've not implemented it for iterators and all the other stuff cause I didn't know how far we should go and I think that just the VAD is enough for testing for now. I managed to make this thing work and confirmed that with a test case I added. I'm going to make a PR with this thing now.

nkeenan38 commented 7 months ago

After evaluating, I think one struct without any const generics would be best. We may be sacrificing some possible optimizations with the compile-time size guarantees, but those differences would be incredibly minor. I'd rather keep the surface of this library simple and only expose a single VoiceActivityDetector. Two structs would also require multiple implementations of the iterator and stream extensions (or incomplete features).

In the future, if we wanted a very optimized VAD with these compile time guarantees, it could be added back. I'd rather the default one be the more flexible and user friendly option.

Also, breaking changes in this library are okay for now until the 1.0.0 release is created.

If you'd like to take on the work of removing the const generic and replacing the arrays with Vecs throughout the crate, I'm happy to accept a PR. I'll get around to making the changes tonight otherwise.

Unbreathable commented 7 months ago

Well that's this issue closed then, thanks for your work, you were really fast to implement this!

nkeenan38 / voice_activity_detector

Making the chunk size not constant for real-time audio purposes #9