pzinemanas / APNet

Audio Prototype Network (APNet)
MIT License
17 stars 6 forks source link

Melspectrogram Normalization #3

Open ChrisNick92 opened 1 month ago

ChrisNick92 commented 1 month ago

Hello Pablo and everyone, very nice job on developing an interpretable model for sound event classification!

Currently am studying your approach and trying to follow your architecture. I have one question with regard to the input melspectrogram. As I see from your paper in your last layer you use a tanh activation function that maps the values to [-1, 1] range. However, the input melspectrogram does not have values in [-1, 1] so I assume you use some normalization to get the values in the same range to be used for the MSE Loss.

May I ask what kind of normalization do you use in the input melspectrogram?

Thanks!

pzinemanas commented 1 month ago

Hi @ChrisNick92 ,

Glad to see you are trying APNet out. We are using minmax normalization in order to have the melspectrograms in the [-1, 1] range. See the configuration file and where the scaler is defined before training. Let me know if you need anything else.

Cheers!

ChrisNick92 commented 1 month ago

Thank you Pablo for your quick response, actually I have one more question that came up atm. There might be a bug in my code but I just want to rule out if there is anything else that I do not understand or that it is not written in your paper.

I am trying to implement the architecture in Pytorch and evaluate it in the UrbanSound8K dataset. While both losses (mse, and prototype loss) decrease with each backprop step the Cross Entropy Loss remains the same. I have checked the gradients of the linear layer (before softmax) and the get updated but the loss remains high and the accuracy score in the validation set is always around 10%.

Did you face any problem in minimizing the cross entropy? Is there any detail that I am missing or not written in the paper. I use a simple linear layer (with no bias) on the weighted similarity matrix S_hat which has dimensions $N\times M$ and mapping this to output logits of shape $N\times 10$, where $10$ is the number of targeted classes.

pzinemanas commented 1 month ago

I can't remember if I faced problems minimizing the cross entropy. But, I was reviewing the code, and noticed that the fully-connected layer is initialized so each output is connected to some prototypes before training.

https://github.com/pzinemanas/APNet/blob/master/experiments/UrbanSound8k/APNet/config.json#L36 https://github.com/pzinemanas/APNet/blob/master/apnet/model.py#L215

I think I did this to try to force the model to have a similar number of prototypes for each class. But in any case I'm not sure this would help you.

ChrisNick92 commented 1 month ago

I can't remember if I faced problems minimizing the cross entropy. But, I was reviewing the code, and noticed that the fully-connected layer is initialized so each output is connected to some prototypes before training.

https://github.com/pzinemanas/APNet/blob/master/experiments/UrbanSound8k/APNet/config.json#L36 https://github.com/pzinemanas/APNet/blob/master/apnet/model.py#L215

I think I did this to try to force the model to have a similar number of prototypes for each class. But in any case I'm not sure this would help you.

Oh that might actually help. I'll try it, thanks again!

ChrisNick92 commented 1 month ago

Hello again! I tried the initialization that you've mentioned on the last layer but it didn't change the results, still the cross entropy loss function does not get minimized.

I was reviewing the paper again to see and I noticed that the prototype loss which involves the matrix $D$ containing the euclidean distances between the prototypes $P$ and the batch input samples $Z$ gets minimized in the case where $D$ contains zero values. One case that this can be achieved is when both the prototypes $P_i$ and the samples $Z_j$ are near the origin (containing very small values).

I printed both the prototypes and the $Z_i$'s on each update during training and realized that indeed their values are of the order of $10^-4$. These small values hinders the cross entropy minimization since the matrix $S$ which is defined as

$$S{ij}[f] = \exp\left(-\sum{t=1}^T\sum_{c=1}^C(Z_i[t,f,c]-P_j[t,f,c])^2\right)$$

remains constants with most values being close to 1. This is because the euclidean distance between the $Z_i$'s and the $P_j$'s is nearly zero. This means that during optimization the gradients resulting from the cross entropy will be zero as well.

I tried an experiment by removing the prototype loss and minimized only the MSE and the Cross entropy just to check if the my code is correct. I trained on the first 6 folds of the UrbanSound8K by using the 7th, 8th folds as a validation for early stopping, and evaluated the performance on the two last folds. Here are the results

 air_conditioner       0.54      0.56      0.55       200
        car_horn       0.92      0.88      0.90        65
children_playing       0.81      0.80      0.80       200
        dog_bark       0.74      0.74      0.74       200
        drilling       0.63      0.68      0.66       200
   engine_idling       0.79      0.56      0.66       182
        gun_shot       0.98      0.87      0.92        63
      jackhammer       0.70      0.81      0.75       178
           siren       0.87      0.76      0.81       165
    street_music       0.78      0.92      0.85       200

        accuracy                           0.74      1653
       macro avg       0.78      0.76      0.76      1653
    weighted avg       0.75      0.74      0.74      1653

As it seems the results are close to what you report (I guess you report the 10-cross-validation, hence the small difference). So the overall architecture is correct. I want to ask if you did anything to prevent this problem when adding the prototype loss, i.e., to prevent the $Z_i$'s and the $P_j$'s to gather around the origin?

Thanks again and sorry for the long post :)

pzinemanas commented 1 month ago

Hi @ChrisNick92

As you mentioned, it seems there is an issue when learning the prototypes. I'd check:

  1. The initialization of the prototypes (we are using uniform initialization)
  2. The initialization of the weighted-sum layer (we are using constant initialization)
  3. The prototype loss is correctly calculated over the distance matrix (https://github.com/pzinemanas/APNet/blob/master/apnet/losses.py)