mimbres / neural-audio-fp

https://mimbres.github.io/neural-audio-fp
MIT License
175 stars 25 forks source link

Reported train, val, test split sizes vs actual FMA size #40

Closed raraz15 closed 6 months ago

raraz15 commented 10 months ago

First of all, thank you very much for the work, having access to the repository gave me a jump start to the field.

After reading the paper, downloading the dataset from IEEE dataport and running some experiments I noticed an issue.

Could you clarify this, please?

I think this can be the reason why the reported metrics on the paper and the metrics that we obtain by training an Adam N=120 model or evaluating the provided 640_lamb model are inconsistent. See: Issue related.

I did some experiments to find these 6,542 tracks. I will report them on the corresponding issue.

mimbres commented 10 months ago

Hi @raraz15, thank you for your deep inspection. Firstly, the name test-dummy-db-100k refers to roughly 100k, not exactly 100k. As you mentioned, while FMA is composed of about 100K tracks, it falls short of 100K if you exclude the test/validation hold-out. Honestly, I just called it 100k for ease of reference, sorry for the confusion!!

Also, I haven't been able to identify the main cause of the performance improvement. This repo is a reconstruction of the code I used at the time of writing the paper, and there might have been bugs in the data configuration back then. Generally, a DB size increase of 6K songs would not make much of a performance difference, but if those songs happened to partially overlap with the Test-set, performance would drop. So, I think your inference seems plausible.

EnthusiasticcitizenYe commented 9 months ago

Hello, author. Thank you for your remarkable contributions. I would like to ask if you have figured out why the results of the project you provided are better than the ones mentioned in your paper.

mimbres commented 9 months ago

@EnthusiasticcitizenYe Unfortunately, as time has passed, it is now difficult to reproduce the paper results identically, and it is difficult to clearly find the cause of the performance improvement. So if you cite this work for benchmarking, I recommend comparing the official re-implementation results along with the result table in the paper. Sorry for the inconvenience.