Fine tuning Yamnet model

Alaa-1 commented 4 years ago

I'm trying to fine tune Yamnet model for another audio classification problem that has 4 classes, but I keep getting this error :

Can not squeeze dim[0], expected a dimension of 1, got 32 [[node model_3/tf_op_layer_Squeeze/Squeeze (defined at :2) ]] [Op:__inference_train_function_51754]

My input shapes are X = (432,80000) and y = (432,)

I've tried both integers and one hot encoding for the labels and still got the same error @plakal @dpwe

plakal commented 4 years ago

Can you share the code you're running and how you're running it?

The released code has only been tested in inference mode and I suspect that the code we use for generating a bunch of framed examples from a clip will need to be a little different if you want to train the model.

Alaa-1 commented 4 years ago

Can you share the code you're running and how you're running it?

The released code has only been tested in inference mode and I suspect that the code we use for generating a bunch of framed examples from a clip will need to be a little different if you want to train the model.

https://colab.research.google.com/drive/1PT5Qiu8buPMNQM6jd32DUpsCU5DbFj_9

I also modified the params.py file. I changed : NUM_CLASSES = 4 CLASSIFIER_ACTIVATION = 'softmax'

Alaa-1 commented 4 years ago

@plakal I don't know if you've noticed but I accidentally closed and reopened my issue. I'm still waiting for your feedback though :+1:

plakal commented 4 years ago

Thanks for sharing the code.

There are several issues:

Don't change NUM_CLASSES or CLASSIFIER_ACTIVATION because these are incompatible with the pre-trained YAMNet checkpoint that we provide. You can still focus on a subset of the classes that you care about while letting the model produce independent scores for all 521 classes. In that params file, the only real parameter that you can realistically change is the hop between successive frames of the same clip.
The code as we have released it, and in particular yamnet_frames_model(), is designed to be used for inference only. The input is expected to be a single waveform (and hence the requirement that the first dimension be 1 which is why you're seeing an error when you try to pass in batch of 32), we generate a batch of examples from that single waveform, run that batch through the model to produce a batch of scores, and then aggregate those scores outside the model to produce scores for the whole clip. This leads to the next two issues that prevent you from simply applying model.fit().
The frontend of the model does not currently deal with a batch of waveforms. But even if you made it work, it's not good practice to take a batch of waveforms, convert them to a bigger batch of examples, and then run those examples in order through the model because now you've biased training by letting the model see a bunch of examples from the same clip at the same time. What you want to do is randomize the training examples. Since the core model accepts individual examples, you'll need to convert all your waveforms to examples ahead of time, shuffles those examples, and then feed those shuffled examples to model.fit() or whatever training loop you want to use.
On the backend, the model produces a 521-wide vector of scores for each 0.96s example. In order to get a prediction for a whole clip, you need to aggregate (e.g., average) the scores over all examples produced from that clip. This is currently done outside the model (see code in inference.py that processes the model predictions) and so can't currently participate in a model.fit() (or other training loop) execution without some extra work.

So you'll need to do a bunch of work to make YAMNet trainable in Keras. We're no Keras experts so I can make some suggestions that you can take as a rough guide:

Separate the core YAMNet model (yamnet() in yamnet.py) from the frontend (the code in yamnet_frames_model() preceding the call to yamnet()).
Run the frontend on the clips in your train and validations splits to produce a big pile of training and validation examples, each with the label of the clip that produced the example. Shuffle each pile to get your training and validation sets.
Run model.fit() on the core model with these train and validation sets. Note that training is happening with example-level ground truth, which is not quite how you compute the final evaluation metrics on the test set.
For your test set, adapt the inference code to generate an aggregate prediction for each clip and then you can compare clip-level prediction and ground-truth to compute the evaluation metrics you need.

I'm sorry that I can't provide much more support than this right now. We might make a fine-tuneable version of YAMNet at some point but we have no plans for that at the moment so we currently only officially support YAMNet for inference.

Alaa-1 commented 4 years ago

Thank you for you time and for the detailed explanation.

What about VGGish do you think I can fine tuned for my use case ?

plakal commented 4 years ago

VGGish can be fine-tuned and we provide a small demo of training as well. There are still a a few issues though:

VGGish was released in 2017 and uses the older TF-Slim framework instead of the modern Keras API, so you'll need to wrap or rewrite it, which shouldn't be too hard.
The core VGGish model takes individual 0.96s frames of log mel spectrogram as input so you still have to deal with making individual frames of examples outside the model.
The core VGGish model produces an embedding. You still need to add a classifier layer on top of this that targets the classes that you are interested in (see the VGGish training demo), and you need to aggregate the scores across all examples from a clip to get the overall prediction for the clip.

I'm going to close this issue for now since this is about as much help as we can provide right now, but we are now aware that there is some demand for models that are easier to use at the clip level (so you don't have to deal with the example framing) and which can be fine-tuned, so we'll keep that in mind for future model releases and updates (but nothing is planned yet).

falibabaei commented 2 years ago

Thank you very much for your great answer to this question. With your help, I was able to fine- tune the yamnet for my dataset. But I have one more question. If I understand it correctly, each audio is divided into frames with a length of patch_window_seconds and a hop length of patch_window_seconds. The input of the model is a batch of these frames. What if there is a frame of silence in each audio and we label that as our object of interest. Is not that problematic? Of course, we can change the patch_window_seconds and patch_hop_seconds parameters in the parameter file, but how can we be sure that each frame ends up containing the audio of the object of interest? Maybe I have misunderstood the model

SaminYaser-work commented 2 years ago

@falibabaei can you share your code please? I am also trying to fine-tune this model.

falibabaei commented 2 years ago

@SaminYaser-work give me your email. I will send you

SaminYaser-work commented 2 years ago

@falibabaei saminyaserwork@gmail.com tysm 😊😊

falibabaei commented 2 years ago

Done

xime-vazquez commented 1 year ago

@falibabaei can you also share your code with me please? im trying to fine tune the model to recognize 4 different speakers but im having a lot of trouble. ximevzquez@gmail.com

falibabaei commented 1 year ago

@falibabaei can you also share your code with me please? im trying to fine tune the model to recognize 4 different speakers but im having a lot of trouble. ximevzquez@gmail.com

Done

akshit7603 commented 11 months ago

@falibabaei can you share your code with me pls?i am having trouble fine tuning. akshita7603@gmail.com

falibabaei commented 11 months ago

@akshit7603 You can find it here https://github.com/falibabaei/yamnet_finetun

loveprolife commented 9 months ago

@falibabaei hi, could you send me a data sample for yamnet_finetun please. qq815117718@gmail.com

tensorflow / models

Fine tuning Yamnet model #8425