Closed ChenyangLEI closed 2 years ago
Hi @ChenyangLEI,
Thank you for your interest.
First, I think your attempt reproduced what I can do locally. As you may find the results of FSD50K pre-training on Table VI, the average is 70.1 %. And the SPCV2 result is ... about 86 % with 512-d embeddings (on my local notes).
The difference of the result comes from two possible differences of settings:
I hope this answers your question.
I see. I will try to use 500 epochs first.
Also, how to choose the 1/10 samples from the audioset? I cannot find it in https://github.com/nttcslab/byol-a/blob/master/utils/preprocess_ds.py.
Thank you very much!
Hi @ChenyangLEI,
AudioSet pre-training is not supported, unfortunately, due to several difficulties. We learned how to make AudioSet samples available from OpenL3 paper and PANNs implementation.
I hope you can find a way to download and use it...
Hi,
Thank you very much for your detailed explanation.
After training on 1/10 Audio set, I obtain similar performance on spcv2 dataset:
Augmentation | Dataset | SPCV2 |
---|---|---|
Mixup + RRC | Audioset 1/10 | 87.4 |
RRC | Audioset 1/10 | 77.4 |
Mixup + RRC | Audioset 1/10 * | 87.7 |
RRC | Audioset 1/10 * | 78.7 |
Table 1. Reproduce Official Setting of BYOL-a . *: since the full audioset is unavailable, we choose the same number of samples from the latest audioset (210,315 samples).
python -m utils.preprocess_ds nsynth ~/data/audio/nsynth/
python evaluate.py lightning_logs/version_$VERSION_NUM/checkpoints/epoch\=99-step\=82199.ckpt spcv2
python evaluate.py lightning_logs/version_$VERSION_NUM/checkpoints/epoch\=99-step\=82199.ckpt nsynth
I obtain the score as follows:
spcv2: epoch=99-step=82199,0.8768741481144934,0.0,0.8768741481144934
nsynth: epoch=99-step=82199,0.560302734375,0.0,0.560302734375
i.e., 56.0% for nsynth, 87.7% for spcv2which is much lower than Table 2
I am not sure what is the problem.
I believe I might make some mistakes, can you help me on that? Thank you very much for your help!
Hi @ChenyangLEI,
Thanks for sharing your reproduction results.
Regarding the NSynth result, it's strange -- I need to check. The code on this repository doesn't reproduce the result even with the provided pre-trained weights.
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d512.pth nsynth
One information so far is, the old and my current code reproduce with the weight AudioNTT2020-BYOLA-64x96d512.pth
.
So there should be a problem with the code on this repository (this repository is just a clean-up, should not have fundamental differences from the ones that work).
Regarding the results of NSynth on Table I (512-d) and Table II on the paper, I just noticed that as you mention. It would be caused by the difference in the sweet spot of the maturity of the representations. Some tasks want representations that reflect more of the original audio, the others need the ones which is more invariant of the difference of sounds. NSynth could be the former. As you can find in the performance difference between 2048-d and 512-d, the more capacity of the representation is the better for NSynth basically. There should be various things that cause the difference in the evaluation performances.
I will try to spot the reason for the NSynth reproducibility issue... Please hold on then.
Hi @ChenyangLEI,
I found the reason.
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth nsynth 4.0 10
This is as I wrote in the README.
# usage: python evaluate.py <your weight> <downstream task> <unit duration sec.> <# of iteration>
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth us8k 4.0 10
Please try, NSynth sample duration is 4.0 seconds. The default is 1.0 seconds, and it has to be changed to 4.0.
P.S. This repository supports SPCV1/2 and NSynth reproduction so far, because the others depended on the tensorflow-dataset implementations. I'm planning to make my next generation evaluation code--which doesn't depend on any others--open in the near future. Hopefully within this year, though I cannot guarantee. FYI.
Hi @daisukelab
After using the correct sample duration, the results are close to the reported results in Table2.
Thank you very much for your help!
1Results on spcv2 is reproduced
After training on 1/10 Audio set, I obtain similar performance on spcv2 dataset:
@ChenyangLEI, could you please share how you sampled 1/10 of AudioSet?
Hi @WhiteTeaDragon ,
Thank you for your interest. Let me follow up on this topic. The 1/10 of AudioSet is originally created in the following way:
Please let me know if you need any other info.
Do I understand correctly that the resulting dataset might be unbalanced, as you do not care about the class balance on the second step?
Also, why is random sampling better? This way, your experiments can be reproduced (if AudioSet videos are still available), but with random sampling there is no chance for reproducing them.
That's correct. We didn't care about labels simply because we don't use them.
The reason why I think random sampling is better is, also simply based on the basics of statistical sampling so that we could avoid biased toward any specific trait of data.
However, for the following reasons, I think the current choice should be no problem.
Please note that, anyway, YouTube samples are disappearing every day. Using AudioSet is not a perfectly reproducible experiment, though we can get almost similar results when testing supervised learning fine-tuning.
Ok, thank you for your answer!
1Results on spcv2 is reproduced
After training on 1/10 Audio set, I obtain similar performance on spcv2 dataset:
@ChenyangLEI, could you please share how you sampled 1/10 of AudioSet?
I think I just randomly sample the audioset
I think I just randomly sample the audioset
Cool, thanks!
Thanks @ChenyangLEI and @WhiteTeaDragon
Hi,
Thanks for sharing this great work! I tried to reproduce the results using the official guidance but I failed.
After processing the data, I run the following commands:
However, the results are far from the reported results
Did I miss something important? Thank you very much.