Closed Alaa-1 closed 4 years ago
Can you share the code you're running and how you're running it?
The released code has only been tested in inference mode and I suspect that the code we use for generating a bunch of framed examples from a clip will need to be a little different if you want to train the model.
Can you share the code you're running and how you're running it?
The released code has only been tested in inference mode and I suspect that the code we use for generating a bunch of framed examples from a clip will need to be a little different if you want to train the model.
https://colab.research.google.com/drive/1PT5Qiu8buPMNQM6jd32DUpsCU5DbFj_9
I also modified the params.py file. I changed : NUM_CLASSES = 4 CLASSIFIER_ACTIVATION = 'softmax'
@plakal I don't know if you've noticed but I accidentally closed and reopened my issue. I'm still waiting for your feedback though :+1:
Thanks for sharing the code.
There are several issues:
Don't change NUM_CLASSES or CLASSIFIER_ACTIVATION because these are incompatible with the pre-trained YAMNet checkpoint that we provide. You can still focus on a subset of the classes that you care about while letting the model produce independent scores for all 521 classes. In that params file, the only real parameter that you can realistically change is the hop between successive frames of the same clip.
The code as we have released it, and in particular yamnet_frames_model(), is designed to be used for inference only. The input is expected to be a single waveform (and hence the requirement that the first dimension be 1 which is why you're seeing an error when you try to pass in batch of 32), we generate a batch of examples from that single waveform, run that batch through the model to produce a batch of scores, and then aggregate those scores outside the model to produce scores for the whole clip. This leads to the next two issues that prevent you from simply applying model.fit().
The frontend of the model does not currently deal with a batch of waveforms. But even if you made it work, it's not good practice to take a batch of waveforms, convert them to a bigger batch of examples, and then run those examples in order through the model because now you've biased training by letting the model see a bunch of examples from the same clip at the same time. What you want to do is randomize the training examples. Since the core model accepts individual examples, you'll need to convert all your waveforms to examples ahead of time, shuffles those examples, and then feed those shuffled examples to model.fit() or whatever training loop you want to use.
On the backend, the model produces a 521-wide vector of scores for each 0.96s example. In order to get a prediction for a whole clip, you need to aggregate (e.g., average) the scores over all examples produced from that clip. This is currently done outside the model (see code in inference.py that processes the model predictions) and so can't currently participate in a model.fit() (or other training loop) execution without some extra work.
So you'll need to do a bunch of work to make YAMNet trainable in Keras. We're no Keras experts so I can make some suggestions that you can take as a rough guide:
I'm sorry that I can't provide much more support than this right now. We might make a fine-tuneable version of YAMNet at some point but we have no plans for that at the moment so we currently only officially support YAMNet for inference.
Thank you for you time and for the detailed explanation.
What about VGGish do you think I can fine tuned for my use case ?
VGGish can be fine-tuned and we provide a small demo of training as well. There are still a a few issues though:
I'm going to close this issue for now since this is about as much help as we can provide right now, but we are now aware that there is some demand for models that are easier to use at the clip level (so you don't have to deal with the example framing) and which can be fine-tuned, so we'll keep that in mind for future model releases and updates (but nothing is planned yet).
Thank you very much for your great answer to this question. With your help, I was able to fine- tune the yamnet for my dataset. But I have one more question. If I understand it correctly, each audio is divided into frames with a length of patch_window_seconds and a hop length of patch_window_seconds. The input of the model is a batch of these frames. What if there is a frame of silence in each audio and we label that as our object of interest. Is not that problematic? Of course, we can change the patch_window_seconds and patch_hop_seconds parameters in the parameter file, but how can we be sure that each frame ends up containing the audio of the object of interest? Maybe I have misunderstood the model
@falibabaei can you share your code please? I am also trying to fine-tune this model.
@SaminYaser-work give me your email. I will send you
@falibabaei saminyaserwork@gmail.com tysm 😊😊
Done
@falibabaei can you also share your code with me please? im trying to fine tune the model to recognize 4 different speakers but im having a lot of trouble. ximevzquez@gmail.com
@falibabaei can you also share your code with me please? im trying to fine tune the model to recognize 4 different speakers but im having a lot of trouble. ximevzquez@gmail.com
Done
@falibabaei can you share your code with me pls?i am having trouble fine tuning. akshita7603@gmail.com
@akshit7603 You can find it here https://github.com/falibabaei/yamnet_finetun
@falibabaei hi, could you send me a data sample for yamnet_finetun please. qq815117718@gmail.com
I'm trying to fine tune Yamnet model for another audio classification problem that has 4 classes, but I keep getting this error :
Can not squeeze dim[0], expected a dimension of 1, got 32 [[node model_3/tf_op_layer_Squeeze/Squeeze (defined at:2) ]] [Op:__inference_train_function_51754]
My input shapes are X = (432,80000) and y = (432,)
I've tried both integers and one hot encoding for the labels and still got the same error @plakal @dpwe