nttcslab / byol-a

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
https://arxiv.org/abs/2103.06695
Other
204 stars 35 forks source link

Evaluation on voxforge #15

Closed Huiimin5 closed 1 year ago

Huiimin5 commented 1 year ago

Hi,

Thank you so much for your contribution. This works is very interesting and your code is easy for me to follow. But one of the downstream dataset, voxforge is missing from the preprocess_ds.py. Could you please release the code for that dataset, too?

Thank you again for your time.

Best regards

daisukelab commented 1 year ago

Hi @Huiimin5, Thank you for your comments and attention.

I'm afraid, but we have moved on to the new evaluation package EVAR. Could you refer to it, please? It should provide complete information for evaluating various audio representations.

You might first visit the V2 page of the BYOL-A for your quick start.

We are sorry that it's not a quick solution for you. But I'm sure this is an answer to provide in-depth materials.

Huiimin5 commented 1 year ago

Hi, Thank you so much for your quick response. It is a very nice platform to evaluate audio representations. I notice that the metadata of voxfurge dataset is provided here: https://raw.githubusercontent.com/nttcslab/eval-audio-repr/main/evar/metadata/voxforge.csv. Does it mean that with this metadata and using convert_wav() to preprocess the dataset downloaded from the script here: https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-datasets.md#voxforge, I should be able to reproduce the results reported in the first versoin of BYOL-A?

daisukelab commented 1 year ago

Aha, I have forgotten about that. My quick answer is, it should be used to reproduce the first version also. However, I need time to check if it reproduces very precisely. Please keep this open... until hopefully by the end of this month. (ICASSP...) (But I am 90% sure that I have reused a copy of the first version materials to make it reproducible.)

Huiimin5 commented 1 year ago

OKOK, many thanks. I will also try to reproduce and let you know the results. By the way, could you please also release code pieces for SPCV2/12? Thank you again for your time.

daisukelab commented 1 year ago

For the SPCV2/12, we used TFDS. https://www.tensorflow.org/datasets/catalog/speech_commands And we are sorry that we evaluated with old code, which is not compatible with the current code. Honestly, we need to rewrite it if we are going to make it open.

We stopped using 12 class version of the Speech Commands because the splits are not formally defined (or, I should say, it is unclear), thus, less reproducible.

Huiimin5 commented 1 year ago

Got it. Could you please share the metadata you used in your experiment? Thank you for your help.

daisukelab commented 1 year ago

That's the problem. We totally relied on the TFDS metadata. No CSV... I am sorry that we cannot help with it. P.S. AFAIRC, it was not a matter only with the metadata. It might have created samples or done some special handling with 'silence'. That's why we needed to rely on TFDS.

daisukelab commented 1 year ago

@Huiimin5 Thank you for waiting. (Though I guess it's too late for your ICASSP submission)

Quick answers

As far as using the EVAR, we can reproduce closer results on the newer BYOL-A paper (It was accepted, and will be published soon in a journal).

What I could confirm so far

I confirmed closer, but better results than the first BYOL-A paper.

python lineareval.py config/byola.yaml voxforge -> resulted in 0.91817, 0.91874, 0.91867, ... python lineareval.py config/byola.yaml vc1 --> resulted in about 0.424

It's the same situation with the VoxCeleb1

I confirmed that we have also used TFDS with VoxCeleb1.

I hope these answer your questions. If you need to ask any other things, please feel free to ask.

daisukelab commented 1 year ago

@Huiimin5 Hi, I'm closing this issue. But you can re-open whatever you wish to ask. Thanks for your questions.