Closed zxf-icpc closed 2 months ago
Hello! Thank you for trying.
Both
models are trained on both datasets, which is different from the setup in the paper (score lower on both datasets compared to models trained only on each dataset, but perform better for the demo).
You should try audiocaps/large
to get the result in Table 1.
Additionally, if you get evaluation results directly from raw waveforms, the scores will be slightly different (higher for large case) from the reported ones. You can get the exactly same score when you infer using the preprocessed data (except clotho/base
which we lost the checkpoint and reproduced our own). This might be due to the different audio resampling processes for inference from raw waveform, as we modified it for the gradio demo.
I will also give you the predicted captions for AudioCaps large model in a few days (I am now midterm).
I will try 'audiocaps/large'. Thank you very much!
@zxf-icpc Here is the prediction for our original result.
Thank you very much for your help and response! The information you provided is very helpful to me.
Hi, I would like to express my admiration for the excellent work you have presented in your paper. After downloading the repository and attempting to reproduce the results from 'both/large/pytorch_model.bin', I noticed that my outcomes are slightly lower than those reported in Table.1 of the paper. I am curious if this discrepancy could be attributed to the sampling of BART.
To better understand and verify my reproduction process, would it be possible for you to share the predicted captions that were obtained in your study? Your assistance in this matter would be greatly appreciated.
Thank you for your time and consideration.