Binarizing linear predictions

kkoutini / PaSST

Efficient Training of Audio Transformers with Patchout

Apache License 2.0

287 stars 48 forks source link

Dear authors,

Thank you for the great work! I would like to know what is the best way to binarize the linear predicted probabilities in a way that :

0 : audio label is absent 1: audio label is present

If you have any suggestion for binarization issue , it would be great to know it.

And one more question, as I understood from the paper linear probability value for each label shows the presence of that audio label in the input audio and probability value doesn't depend on the duration of audio label happens. I mean if it happens during the very short duration or long duration. Am I right?

Another question is there any possible difference between feeding audio data that is typically 20-90 seconds long (which is not monophonic) vs slicing it in chunks or running second-by-second predictions. I would like to know is it good idea to run second-by-second prediction with Passt?

It would be great for me to get your answers to the above-mentioned questions.

Anar Sultani

Hi Anar, thank you for the interest.

The model return logits. You can apply a sigmoid function on these to get a value between 0,1. You can then set some threshold (like 0.5) to binarize the outputs.

I don't believe that the probability has to do with the length of the event in the transformer model, because there is no pooling applied. However, this may be true in CNNs models with a global average pooling for example.

I think you can run a second-by-second predictions, we had competitive results in HEAR challenge on short events. The wrapper is published as a pip package here. you can use it to get embeddings or logits for shorter audio clips. You can also change the window length example.

kkoutini / PaSST

Binarizing linear predictions #4