Closed Huiimin5 closed 2 years ago
Hi @Huiimin5, Thank you for your comments and attention.
I'm afraid, but we have moved on to the new evaluation package EVAR. Could you refer to it, please? It should provide complete information for evaluating various audio representations.
You might first visit the V2 page of the BYOL-A for your quick start.
We are sorry that it's not a quick solution for you. But I'm sure this is an answer to provide in-depth materials.
Hi, Thank you so much for your quick response. It is a very nice platform to evaluate audio representations. I notice that the metadata of voxfurge dataset is provided here: https://raw.githubusercontent.com/nttcslab/eval-audio-repr/main/evar/metadata/voxforge.csv. Does it mean that with this metadata and using convert_wav() to preprocess the dataset downloaded from the script here: https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-datasets.md#voxforge, I should be able to reproduce the results reported in the first versoin of BYOL-A?
Aha, I have forgotten about that. My quick answer is, it should be used to reproduce the first version also. However, I need time to check if it reproduces very precisely. Please keep this open... until hopefully by the end of this month. (ICASSP...) (But I am 90% sure that I have reused a copy of the first version materials to make it reproducible.)
OKOK, many thanks. I will also try to reproduce and let you know the results. By the way, could you please also release code pieces for SPCV2/12? Thank you again for your time.
For the SPCV2/12, we used TFDS. https://www.tensorflow.org/datasets/catalog/speech_commands And we are sorry that we evaluated with old code, which is not compatible with the current code. Honestly, we need to rewrite it if we are going to make it open.
We stopped using 12 class version of the Speech Commands because the splits are not formally defined (or, I should say, it is unclear), thus, less reproducible.
Got it. Could you please share the metadata you used in your experiment? Thank you for your help.
That's the problem. We totally relied on the TFDS metadata. No CSV... I am sorry that we cannot help with it. P.S. AFAIRC, it was not a matter only with the metadata. It might have created samples or done some special handling with 'silence'. That's why we needed to rely on TFDS.
@Huiimin5 Thank you for waiting. (Though I guess it's too late for your ICASSP submission)
As far as using the EVAR, we can reproduce closer results on the newer BYOL-A paper (It was accepted, and will be published soon in a journal).
I confirmed closer, but better results than the first BYOL-A paper.
python lineareval.py config/byola.yaml voxforge -> resulted in 0.91817, 0.91874, 0.91867, ... python lineareval.py config/byola.yaml vc1 --> resulted in about 0.424
I confirmed that we have also used TFDS with VoxCeleb1.
I hope these answer your questions. If you need to ask any other things, please feel free to ask.
@Huiimin5 Hi, I'm closing this issue. But you can re-open whatever you wish to ask. Thanks for your questions.
Hi,
Thank you so much for your contribution. This works is very interesting and your code is easy for me to follow. But one of the downstream dataset, voxforge is missing from the preprocess_ds.py. Could you please release the code for that dataset, too?
Thank you again for your time.
Best regards