juanmc2005 / diart

A python package to build AI-powered real-time audio applications
https://diart.readthedocs.io
MIT License
903 stars 76 forks source link

Add support for powerset segmentation models #198

Closed hbredin closed 8 months ago

hbredin commented 8 months ago

Addresses #186.

Note that this is a first (working) attempt that still needs some love. Hence the draft status...

As a bonus, you get the first (?) walrus operator of diart, yay!

juanmc2005 commented 8 months ago

@hbredin as you mentioned in #186, I would also prefer to have a single instantiation of Powerset that runs in the same device as SegmentationModel.

I think we have 2 options here:

  1. We implement it as a kind of middleware or adapter. Essentially we could have a class PowersetAdapter so the PyannoteLoader can do something like return PowersetAdapter(Model.from_pretrained(model_info))
  2. We implement it as a block. Here we could add a PowersetToMultilabel block that simply expects a powerset input and does the conversion. For this, we'd have to know from the model whether it is powerset or not, for example by adding a @property abstract method to SegmentationModel. This could simply default to False so that it isn't a concern for most users

I would prefer the first one for now because it's automatic and has minimal impact, but we may have to move to the second one if someone else (other than pyannote) releases a powerset model.

Example of (1)

class PowersetAdapter(nn.Module):
    def __init__(self, segmentation_model: nn.Module):
        self.model = segmentation_model
        self.powerset = Powerset(...)

    def __call__(self, waveform: torch.Tensor) -> torch.Tensor:
        return self.powerset.to_multilabel(self.model(waveform), soft=False)

class PyannoteLoader:
    ...
    def __call__(self) -> nn.Module:
        model = pyannote_loader.get_model(self.model_info, self.hf_token)
        specs = getattr(model, "specifications", None)
        if specs is not None and specs.powerset:
            model = PowersetAdapter(model)
        return model
hbredin commented 8 months ago

Trying this but now diart.stream complains that AttributeError: 'PyannoteSegmentationModel' object has no attribute 'duration' even though I added the following properties to PowersetAdapter:

    @property
    def sample_rate(self) -> int:
        return self.model.hparams.sample_rate

    @property
    def duration(self) -> float:
        return self.model.specifications.duration

A bit lost here but it's late :-) Sleep will most likely help!

juanmc2005 commented 8 months ago

@hbredin that's weird, can you push the code so I can take a look? you probably need to forward specifications from PowersetAdapter to Model:

class PowersetAdapter(nn.Module):
    def __init__(self, segmentation_model: nn.Module):
        self.model = segmentation_model
        self.powerset = Powerset(...)

    @property
    def specifications(self):
        return getattr(self.model, "specifications", None)

    def __call__(self, waveform: torch.Tensor) -> torch.Tensor:
        return self.powerset.to_multilabel(self.model(waveform), soft=False)

Because PyannoteSegmentationModel will need the loaded model to have model.specifications.duration and model.specifications.sample_rate. Again, this will disappear when I move the config to a yaml file. That way we won't need a default duration or sample rate, it will be expected in the config or CLI args

juanmc2005 commented 8 months ago

Thanks! I'll try to debug after work today or tomorrow and get back if it's not solved until then 😄

hbredin commented 8 months ago

Adding the specifications properties does not help.

juanmc2005 commented 8 months ago

@hbredin can you post the stacktrace?

hbredin commented 8 months ago
diart.stream --segmentation pyannote/segmentation-3.0 audio.wav
Traceback (most recent call last):
  File "REDACTED/bin/diart.stream", line 8, in <module>
    sys.exit(run())
  File "REDACTED/diart/src/diart/console/stream.py", line 107, in run
    pipeline = pipeline_class(config)
  File "REDACTED/diart/src/diart/blocks/diarization.py", line 97, in __init__
    msg = f"Latency should be in the range [{self._config.step}, {self._config.duration}]"
  File "REDACTED/diart/src/diart/blocks/diarization.py", line 74, in duration
    self._duration = self.segmentation.duration
  File "REDACTED/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'PyannoteSegmentationModel' object has no attribute 'duration'
juanmc2005 commented 8 months ago

@hbredin while we figure this out, you can override the duration with --duration 5. For the sample rate, which I imagine will be a similar problem, you can temporarily hard-code it in SpeakerDiarizationConfig. That should unblock you for now to try out the new model

hbredin commented 8 months ago

Ah! --duration=10 solves this first issue but it now complains about missing sample_rate :-) And there does not seem to be a --sample-rate option :-/

hbredin commented 8 months ago

Ok. Misread your previous comment. You were already aware of it :)

juanmc2005 commented 8 months ago

@hbredin I don't know if you branched from main but I highly recommend rebasing on top of develop now that #188 is merged

juanmc2005 commented 8 months ago

@hbredin we just broke a record here, performance on AMI using duration=10, step=0.5 and latency=5 (same as the paper except for the 10s context) gives DER=26.7. Previous best on AMI for that config was 27.3

This is without tuning rho_update and delta_new, which should squeeze a bit more performance. I would like to run the tuning myself but I fear my laptop will catch fire 😅 I'd really like to have a caching feature for that

hbredin commented 8 months ago

@hbredin we just broke a record here, performance on AMI using duration=10, step=0.5 and latency=5 (same as the paper except for the 10s context) gives DER=26.7. Previous best on AMI for that config was 27.3

This is without tuning rho_update and delta_new, which should squeeze a bit more performance. I would like to run the tuning myself but I fear my laptop will catch fire 😅 I'd really like to have a caching feature for that

Wait until I try with pyannote.premium ;-) What's the command line I should run?

hbredin commented 8 months ago

All checks are failing but I don't think they are related to this PR.

juanmc2005 commented 8 months ago

yeah don't worry about the "Quick Runs" CI fails, it's unrelated. It needs a huggingface token to run, and it can't find it in your fork's secrets. This is actually why I want to host a pair of freely available ONNX models somewhere to run the CI, probably even quantized models.

However, please format with black so the lint passes.

You can run the following command for the AMI eval:

diart.benchmark /ami/audio/test --reference /ami/rttms/test --segmentation pyannote/segmentation-3.0 --duration 10 --latency 5 --step 0.5 --tau-active 0.507 --rho-update 0.006 --delta-new 1.057 --batch-size 32 --num-workers 3

Now you start to see why I want to put configs in a yml file 😅

juanmc2005 commented 8 months ago

@hbredin looks like something went wrong with your rebase. I'm missing changes from #188

sorgfresser commented 8 months ago

If I am not mistaken @hbredin we should not need to tune Rho (e.g. the SpeakerThreshhold) for the Powerset model, as such it might be worth it to subclass the SpeakerDiarization pipeline with a custom hyper_parameters() function?

Edit: Rho should be tuned, see below, I confused rho and tau.

juanmc2005 commented 8 months ago

@sorgfresser you may still want to tune rho_update, the powerset model doesn't relieve you of this parameter. It's removing the embedding model that would help you with that.

Keep in mind that rho_update can be interpreted as "what percentage of the chunk must this embedding represent in order to update the clustering centroids?"

sorgfresser commented 8 months ago

Sorry, I was referring to Tau, it's getting late...

juanmc2005 commented 8 months ago

Ok apart from the linting and the Inference import that we should remove, this is good to go from my side. I'll wait for those changes and merge

hbredin commented 8 months ago

Quick pyannote.premium run without any hyperparameters tuning:

diart.benchmark /ami/audio/test --reference /ami/rttms/test \
                                --segmentation ... \ 
                                --latency ... --step 0.5 \
                                --tau-active 0.507 --rho-update 0.006 --delta-new 1.057 
Segmentation Embedding Latency FA MD SC
pyannote/segmentation-3.0 pyannote/embedding 5s 3.7 10.1 12.6
pyannote/premium 👀 pyannote/embedding 1s 👀 3.8 7.6 🎉 16.4 😠

FA = false alarm rate / MD = missed detection rate / SC = speaker confusion rate

Looks like pyannote/embedding default clustering hyper-parameters (degraded SC 😠) are not adapted to pyannote/premium segmentation (improved FA+MD 🎉).

Still needs a bit of hparams tuning but very promising!

juanmc2005 commented 8 months ago

@hbredin nice! I see you've been having fun with diart.benchmark then 😄

sorgfresser commented 8 months ago

Sidenote: this requires pyannote develop version as of now since pyannote/pyannote-audio#1516 is needed.

hbredin commented 8 months ago

Not sure when I'll release that so it would be safer to remove the use of soft=False which is anyway the default behavior in pyannote.audio 3.0.1