kkoutini / PaSST

Efficient Training of Audio Transformers with Patchout
Apache License 2.0
295 stars 50 forks source link

time_new_pos_embed #42

Closed Antoine101 closed 6 months ago

Antoine101 commented 7 months ago

Hi Khaled,

I am playing with your code a bit and I struggle to understand these few lines below:

        # Adding Time/Freq information
        if first_RUN: print(" self.time_new_pos_embed.shape", self.time_new_pos_embed.shape)
        time_new_pos_embed = self.time_new_pos_embed
        if x.shape[-1] < time_new_pos_embed.shape[-1]:
            if self.training:
                toffset = torch.randint(1 + time_new_pos_embed.shape[-1] - x.shape[-1], (1,)).item()
                if first_RUN: print(f" CUT with randomoffset={toffset} time_new_pos_embed.shape",
                                    time_new_pos_embed.shape)
                time_new_pos_embed = time_new_pos_embed[:, :, :, toffset:toffset + x.shape[-1]]
            else:
                time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]]
            if first_RUN: print(" CUT time_new_pos_embed.shape", time_new_pos_embed.shape)
        else:
            warnings.warn(
                f"the patches shape:{x.shape} are larger than the expected time encodings {time_new_pos_embed.shape}, x will be cut")
            x = x[:, :, :, :time_new_pos_embed.shape[-1]]
        x = x + time_new_pos_embed

Especially the slicing of time_new_pos_embed with toffset. I understand the slicing in the first else and the second else but I don't get why the slicing is randomized at training. If it's a position embedding surely it shouldn't be random right?

Many thanks in advance.

Antoine

kkoutini commented 7 months ago

Hi Antoine, You're right, the randomized slicing during training works by taking a substring from the time position embedding in order to learn time position embeddings for a longer clips. For example, Here the models passt-s-f128-20sec-p16-s10-ap.474-swa.pt and passt-s-f128-30sec-p16-s10-ap.473-swa.pt can accept as input audio clips of 20 seconds or 30 seconds, while being trained only on 10-second clips of audioset.

Antoine101 commented 7 months ago

Thank you for your swift reply!

Hmm... not sure I get it!

What I understood is that you have different models that can take clips in up to different lengths each (10s, 20s, 30s). Their input size varies accordingly (128x998, 128x2000, 128x3000, ...).

If I build a model with the configuration associated to passt-s-f128-20sec-p16-s10-ap.474-swa.pt, do we agree that I will only be able to fine-tune or infer on clips that are AS or LESS long than 20sec (but not more)?

In the first else, you do: time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]] which makes sense to me as if I pass a clip that is 10sec for example (while working with the model that's able to process up to 20sec clips) so half the size of what the model was trained on, I want to associate the time position embeddings from 0 to 10sec worth of patches.

In the second else, you do: x = x[:, :, :, :time_new_pos_embed.shape[-1]] which is to handle the case where the input clip is longer than what the model was trained on. So it makes sense here to trim x up to time_new_pos_embed.shape[-1] as x is longer.

What I struggle to understand is the use of randomization at training time.

toffset = torch.randint(1 + time_new_pos_embed.shape[-1] - x.shape[-1], (1,)).item()
time_new_pos_embed = time_new_pos_embed[:, :, :, toffset:toffset + x.shape[-1]]

why are you using a random offset here? Shouldn't it work like time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]] ? I would expect to always pass embeddings starting from 0. It seems here that you could associate embeddings related to later patches, to earlier patches. Or doesn't it work like this?

Let's say we have x as : x1 x2 x3 x4 x5 And our model is able to take in up to 10 patches: x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Our time_new_pos_embed is initialized as: e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 Here I would associate like this: x1+e1 x2+e2 x3+e3 x4+e4 x5+e5 But the code seems to suggest that during training, this can also happen, on a random basis: x1+e4 x2+e5 x3+e6 x4+e7 x5+e8 or x1+e2 x2+e3 x3+e4 x4+e5 x5+e6 etc...

Obviously the above illustration doesn't reflect the right tensors dimensions but I tried to lay down my thinking as best as I could.

Thanks a lot again in advance for your help.

Antoine

kkoutini commented 7 months ago

Hi Antoine, I think you're completely right. The unclear part I believe - is the randomization. Let's assume that we have training data compromised only of 10-second audio clips , but we need a model to give the predictions on 20-second clips. During training, If we cut the trainable time position encoding by taking only those correspond to the first 10 seconds, then the remaining time pos encoding wouldn't be trained, and the model cannot predict on longer audio. The simple approach implemented here is to sample a substring of the time encoding always corresponds to 10-seocnds ( to match the actual training audio length). However, the start of this 10-second encodings is randomly chosen within the the 20-second encodings. So during training the model always sees only 10-second audio and positional encoding. But these encodings randomly shift within all the possible encodings. Let me know if this makes sense. There are probably better and more suitable ways to accomplish this, but I chose this simple sampling method. Best, Khaled

On Thu, Mar 7, 2024, 09:45 Antoine101 @.***> wrote:

Thank you for your swift reply!

Hmm... not sure I get it!

What I understood is that you have different models that can take clips in up to different lengths each (10s, 20s, 30s). Their input size varies accordingly (128x998, 128x2000, 128x3000, ...).

If I build a model with the configuration associated to passt-s-f128-20sec-p16-s10-ap.474-swa.pt https://github.com/kkoutini/PaSST/releases/download/v0.0.5/passt-s-f128-20sec-p16-s10-ap.474-swa.pt, do we agree that I will only be able to fine-tune or infer on clips that are AS or LESS long than 20sec (but not more)?

In the first else, you do: time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]] which makes sense to me as if I pass a clip that is 10sec for example (while working with the model that's able to process up to 20sec clips) so half the size of what the model was trained on, I want to associate the time position embeddings from 0 to 10sec worth of patches.

In the second else, you do: x = x[:, :, :, :time_new_pos_embed.shape[-1]] which is to handle the case where the input clip is longer than what the model was trained on. So it makes sense here to trim x up to time_new_pos_embed.shape[-1] as x is longer.

What I struggle to understand is the use of randomization at training time.

toffset = torch.randint(1 + time_new_pos_embed.shape[-1] - x.shape[-1], (1,)).item() time_new_pos_embed = time_new_pos_embed[:, :, :, toffset:toffset + x.shape[-1]]

why are you using a random offset here? Shouldn't it work like time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]] ? I would expect to always pass embeddings starting from 0. It seems here that you could associate embeddings related to later patches, to earlier patches. Or doesn't it work like this?

Let's say we have x as : x1 x2 x3 x4 x5 And our model is able to take in up to 10 patches: x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Our time_new_pos_embed is initialized as: e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 Here I would associate like this: x1+e1 x2+e2 x3+e3 x4+e4 x5+e5 But the code seems to suggest that during training, this can also happen, on a random basis: x1+e4 x2+e5 x3+e6 x4+e7 x5+e8 or x1+e2 x2+e3 x3+e4 x4+e5 x5+e6 etc...

Obviously the above illustration doesn't reflect the right tensors dimensions but I tried to lay down my thinking as best as I could.

Thanks a lot again in advance for your help.

Antoine

— Reply to this email directly, view it on GitHub https://github.com/kkoutini/PaSST/issues/42#issuecomment-1982975796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AML2GDW6T7S2YAMKNJSFP6LYXASLPAVCNFSM6AAAAABEJC45P6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBSHE3TKNZZGY . You are receiving this because you commented.Message ID: @.***>

Antoine101 commented 7 months ago

Ok I think I understand your logic and why you chose to do this!

I thought that when you said your models accept inference on clips up to 20s or 30s, it meant they were respectively trained on strict 20s or 30s clips. But you are saying that those models were trained on variable length clips below these limits.

Regardless, It is a bit counterintuitive for me to randomize the time position encodings as I would tend to think that if you associate an similar index encoding to a different patch time randomly each time during training, it is not gonna be able to learn any positioning relationships. Or is it? For me, e1 should always be associated to x1, e2 to x2, and so on and so forth. It may not be a problem for stationnary sounds for which mel-spectrograms will be similar from 0 to 10s and 10 to 20s for example but what about acoustic signatures like a plane taking-off where you'll see a distinctive evolution of harmonics through the 20s (meaning the 10 last seconds are complementary to the first 10 seconds to classify this sound)? I am trying to think about cases where this approach may prove problematic. Sorry if it's a bit fuzzy...

Have you tried with AND without randomization? The results you mentioned in your paper are really good so I guess it must work as is.

Cheers

Antoine

kkoutini commented 6 months ago

Regardless, It is a bit counterintuitive for me to randomize the time position encodings as I would tend to think that if you associate an similar index encoding to a different patch time randomly each time during training, it is not gonna be able to learn any positioning relationships. Or is it?

The encoding are always for 10-consecutive seconds, corresponding to 10 seconds of audio. Of course, you are right it won't be as good as if you traine on 20 second input. But giving the limitation of having only 10-second training, you can this way train each crop of 10-second encodings to represent relative position. Keep in mind that audioset 10-second clips are often clipped from longer audios.

Have you tried with AND without randomization?

I did not try without randomization, because the remaing encodings (e11-e20) would not be learned during training

Antoine101 commented 6 months ago

Yeah sorry I thought about it multiple times and you're right. So the 10-seconds encoding is our fixed context, so to say. So we would not perform that well for sound events that are longer than this, or that need to be longer than 10s to be "recognized". But that would likely never be the case, as only a few seconds are sufficient in most cases (ex: a few dog barks). And training all encodings randomly make sense to have a model capable of infering on longer audio.

Thank you for your reply, as always!