ChenyangLEI commented 3 years ago

Hi,

Thanks for sharing this great work! I tried to reproduce the results using the official guidance but I failed.

After processing the data, I run the following commands:

CUDA_VISIBLE_DEVICES=0 python -W ignore train.py work/16k/fsd50k/FSD50K.dev_audio
cp lightning_logs/version_4/checkpoints/epoch\=99-step\=16099.ckpt AudioNTT2020-BYOLA-64x96d2048.pth
CUDA_VISIBLE_DEVICES=4 python evaluate.py AudioNTT2020-BYOLA-64x96d2048.pth spcv2

However, the results are far from the reported results

Did I miss something important? Thank you very much.

daisukelab commented 3 years ago

Hi @ChenyangLEI,

Thank you for your interest.

First, I think your attempt reproduced what I can do locally. As you may find the results of FSD50K pre-training on Table VI, the average is 70.1 %. And the SPCV2 result is ... about 86 % with 512-d embeddings (on my local notes).

The difference of the result comes from two possible differences of settings:

The pre-training dataset. If you could use at least 1/10 portion of AudioSet, the result should get closer to 90~92 %.
Number of epochs, we used 500 epochs as written in Appendix C.

I hope this answers your question.

ChenyangLEI commented 3 years ago

I see. I will try to use 500 epochs first.

Also, how to choose the 1/10 samples from the audioset? I cannot find it in https://github.com/nttcslab/byol-a/blob/master/utils/preprocess_ds.py.

Thank you very much!

daisukelab commented 3 years ago

Hi @ChenyangLEI,

AudioSet pre-training is not supported, unfortunately, due to several difficulties. We learned how to make AudioSet samples available from OpenL3 paper and PANNs implementation.

I hope you can find a way to download and use it...

ChenyangLEI commented 2 years ago

Hi,

Thank you very much for your detailed explanation.

1Results on spcv2 is reproduced

After training on 1/10 Audio set, I obtain similar performance on spcv2 dataset:

Augmentation	Dataset	SPCV2
Mixup + RRC	Audioset 1/10	87.4
RRC	Audioset 1/10	77.4
Mixup + RRC	Audioset 1/10 *	87.7
RRC	Audioset 1/10 *	78.7

Table 1. Reproduce Official Setting of BYOL-a . *: since the full audioset is unavailable, we choose the same number of samples from the latest audioset (210,315 samples).

Results on Nsynth is not as good as official results

I notice that the same pretraining ckpt cannot obtain the reported performance on nsynth dataset:

python -m utils.preprocess_ds nsynth ~/data/audio/nsynth/
python evaluate.py lightning_logs/version_$VERSION_NUM/checkpoints/epoch\=99-step\=82199.ckpt spcv2
python evaluate.py lightning_logs/version_$VERSION_NUM/checkpoints/epoch\=99-step\=82199.ckpt nsynth

I obtain the score as follows:

spcv2: epoch=99-step=82199,0.8768741481144934,0.0,0.8768741481144934 
nsynth: epoch=99-step=82199,0.560302734375,0.0,0.560302734375
i.e., 56.0% for nsynth, 87.7% for spcv2which is much lower than Table 2

I am not sure what is the problem.

Another question for the results in the paper

You mention that you use 512-d for Table2, and you obtain 71.2%, trained on 1/10 audioset
The result of 512-d for Table 1 is 69.1%, trained on full audioset
Does that mean using more data leads to performacne degradation on Nsynth dataset?

I believe I might make some mistakes, can you help me on that? Thank you very much for your help!

daisukelab commented 2 years ago

Hi @ChenyangLEI,

Thanks for sharing your reproduction results.

Regarding the NSynth result, it's strange -- I need to check. The code on this repository doesn't reproduce the result even with the provided pre-trained weights.

python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d512.pth nsynth

One information so far is, the old and my current code reproduce with the weight AudioNTT2020-BYOLA-64x96d512.pth. So there should be a problem with the code on this repository (this repository is just a clean-up, should not have fundamental differences from the ones that work).

Regarding the results of NSynth on Table I (512-d) and Table II on the paper, I just noticed that as you mention. It would be caused by the difference in the sweet spot of the maturity of the representations. Some tasks want representations that reflect more of the original audio, the others need the ones which is more invariant of the difference of sounds. NSynth could be the former. As you can find in the performance difference between 2048-d and 512-d, the more capacity of the representation is the better for NSynth basically. There should be various things that cause the difference in the evaluation performances.

I will try to spot the reason for the NSynth reproducibility issue... Please hold on then.

daisukelab commented 2 years ago

Hi @ChenyangLEI,

I found the reason.

python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth nsynth 4.0 10

This is as I wrote in the README.

# usage: python evaluate.py <your weight> <downstream task> <unit duration sec.> <# of iteration>
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth us8k 4.0 10

Please try, NSynth sample duration is 4.0 seconds. The default is 1.0 seconds, and it has to be changed to 4.0.

P.S. This repository supports SPCV1/2 and NSynth reproduction so far, because the others depended on the tensorflow-dataset implementations. I'm planning to make my next generation evaluation code--which doesn't depend on any others--open in the near future. Hopefully within this year, though I cannot guarantee. FYI.

ChenyangLEI commented 2 years ago

Hi @daisukelab

After using the correct sample duration, the results are close to the reported results in Table2.

Thank you very much for your help!

WhiteTeaDragon commented 2 years ago

1Results on spcv2 is reproduced

After training on 1/10 Audio set, I obtain similar performance on spcv2 dataset:

@ChenyangLEI, could you please share how you sampled 1/10 of AudioSet?

daisukelab commented 2 years ago

Hi @WhiteTeaDragon ,

Thank you for your interest. Let me follow up on this topic. The 1/10 of AudioSet is originally created in the following way:

Select all balanced train set samples.
Select unbalanced set samples until the total number of samples reaches 1/10. We picked from the top of the alphabetical order (instead of random sampling ..., which I would do if I do it again).

Please let me know if you need any other info.

WhiteTeaDragon commented 2 years ago

Do I understand correctly that the resulting dataset might be unbalanced, as you do not care about the class balance on the second step?

Also, why is random sampling better? This way, your experiments can be reproduced (if AudioSet videos are still available), but with random sampling there is no chance for reproducing them.

daisukelab commented 2 years ago

That's correct. We didn't care about labels simply because we don't use them.

The reason why I think random sampling is better is, also simply based on the basics of statistical sampling so that we could avoid biased toward any specific trait of data.

However, for the following reasons, I think the current choice should be no problem.

AudioSet is big enough.
The labels seem not to be sorted to align with alphabetical YTID order. Picking from the top of the list is considered to be close to random sampling.
I have sampled a 1/10 set, at least two times. I didn't notice a difference in the test results. I suppose the diversity of samples in AudioSet is good enough for our experiments.

Please note that, anyway, YouTube samples are disappearing every day. Using AudioSet is not a perfectly reproducible experiment, though we can get almost similar results when testing supervised learning fine-tuning.

WhiteTeaDragon commented 2 years ago

Ok, thank you for your answer!

ChenyangLEI commented 2 years ago

1Results on spcv2 is reproduced

After training on 1/10 Audio set, I obtain similar performance on spcv2 dataset:

@ChenyangLEI, could you please share how you sampled 1/10 of AudioSet?

I think I just randomly sample the audioset

WhiteTeaDragon commented 2 years ago

I think I just randomly sample the audioset

Cool, thanks!

daisukelab commented 2 years ago

Thanks @ChenyangLEI and @WhiteTeaDragon

nttcslab / byol-a

Question for reproducing results #11

1Results on spcv2 is reproduced

Results on Nsynth is not as good as official results

I notice that the same pretraining ckpt cannot obtain the reported performance on nsynth dataset:

Another question for the results in the paper

1Results on spcv2 is reproduced

1Results on spcv2 is reproduced