Detailed preprocess dataset and format about AVE-Dataset

JackHenry1992 commented 3 years ago

Thanks for sharing your great job. Can you provide the detailed process of preprocessing AVE-dataset?

JackHenry1992 commented 3 years ago

I have processed the AVE-Dataset using the preprocess.py, and generate trainset. But the loss did not decrease during the training phase. Loss epoch: 32, step: 79, train_loss: 0.8976, train_acc: 0.4969, lr:0.000010

kyuyeonpooh commented 3 years ago

Hi,

Thank you for your interest in my code and project.

Data preprocessing

In my case, I first directly downloaded videos from YouTube using youtube_dl, and saved each video into [YouTube ID of video].mp4

With the above naming convention, when you configure some path settings into config.ini file and then run preprocess.py:

For image, you can get [YouTube ID of video].npz files, each file including 10 frames from a single video (extracted at 1 fps). The frames are resized to 256x256 in default settings.
For audio, you can get [YouTube ID of video].npz files, each file including 10 1-second length of spectrograms from a single video. The reason why I extracted 10 samples is because video clips in AudioSet are 10-second length.

For more details, you can refer to utils/extractor.py. You can also change some settings by changing some parameters in methods of Extractor class.

Loss not decreasing

I also faced this issue. This issue seems to occur because the last fully connected layer is so tiny and so vulnerable to noisy data, compared to other layers. Once the last fully connected layer is misguided, it may never be recovered to the expected state.

Here are some several tips that might help you. However, please remind that the network is not always successfully trained even though you apply all the solutions below.

1. Learning Rate I found using learning rate less or equal than 5e-5 was helpful for successful training. Using learning rate bigger than 1e-4 highly tends to be failed.

2. Use Larger Batch Using larger size of batch seems to be usually helpful for training, as data in AudioSet is quite noisy. In my case, I use 64 as the batch size.

3. In case of training AVE-Net: Tweak the parameter of the last fully connected layer As you can see in models/avenet.py, there are only 4 parameters in self.fc3 in AVE-Net. As this tiny network is very vulnerable to the gradient, I initialized this with fixed value to let it be more robust to the noisy data.

Please change this part like given below. https://github.com/kyuyeonpooh/objects-that-sound/blob/d19f971021a9219aa0987dadeaf7942ec7e4f31a/model/avenet.py#L24-L25

self.fc3 = nn.Linear(1, 2)
self.fc3.weight.data[0] = -0.7
self.fc3.weight.data[1] = 0.7
self.fc3.bias.data[0] = 1.2
self.fc3.bias.data[1] = -1.2

4. One more tip In my case, when I saw the loss decreases below to 0.69, the training had gone successfully.

Comment: Pretrained model is available! Please use them if you need.

If you have any questions or have any more issues, feel free to contact me. You can also leave issues in the repository. I can immediately check.

Sincerely, Kyuyeon.

kyuyeonpooh / objects-that-sound

Detailed preprocess dataset and format about AVE-Dataset #7

Data preprocessing

Loss not decreasing