Hello. I tried using the demo code of Codi (https://github.com/microsoft/i-Code/tree/main/i-Code-V3) to reproduce results on the AudioCaps dataset. However, I was unable to achieve the results reported in the paper for the audio captioning and TTA tasks, with a significant discrepancy in performance:
Hello. I tried using the demo code of Codi (https://github.com/microsoft/i-Code/tree/main/i-Code-V3) to reproduce results on the AudioCaps dataset. However, I was unable to achieve the results reported in the paper for the audio captioning and TTA tasks, with a significant discrepancy in performance:
Frechet Audio Distance: 12.3379363 Kullback-Leibler Divergence (Sigmoid): 9.3400078 Kullback-Leibler Divergence (Softmax): 3.8197691 Inception Score Mean: 2.9589245 Inception Score Std: 0.2177440 Frechet Distance: 54.1079137 Bleu-1: 0.2448 Bleu-2: 0.0918 Bleu-3: 0.0287 Bleu-4: 0.0097 Rouge: 0.1928 CIDEr: 0.0689 METEOR: 0.0877 SPICE: 0.0504 SPIDEr: 0.0596 Here is my code:
the dataset_json is provided by AudioLDM I would like to ask what the specific issues might be?