Open MinnieUoL opened 10 months ago
Thank you for sharing this valuable work. However, I encountered some peculiarities when testing the model. It appears that the performance significantly deviates from what is demonstrated on the demo page. I'm curious if others have observed similar discrepancies, or perhaps I've made an error. If that's the case, I would appreciate any corrections from the authors.
I adhered to the instructions provided in the README to establish the gradio environment and utilized the M2UGEN-MusicGEN-Medium model for inference, which performs best in the paper. Below are the screenshots of the results.
- For music understanding: While the generated audio seems acceptable, the description is entirely inaccurate.
- For music editing: The edited music retains its original piano sound.
- For image-to-music. I tested three common instruments with simple images, but the recognition was completely off, and the music generated bore no relevance to the image.
- I tried to replicate a result from the demo page. However, despite running the experiments three times, I was unable to achieve the perfect results showcased in the demo.
- I tried to generate music that would suit the mood of an image without any instrument constraints. The image's mood should be perceived as exciting and fast-paced by common standards. However, the system returned a slow and calm piece of music.
All of the above experiments were conducted without cherry-picking and were generated on the first attempt. Please do not hesitate to correct me if I have made any mistakes.
Have you tested other hyperparameters in the gradio demo page, such as topP and temperature?
Previously, I did not modify the topP and temperature hyperparameters, opting to maintain their default values, as I observed that these were the exact settings used on your demo page. However, I have now tested with various other values and found that both the text and audio feedback still appear to be incorrect. The feedback seems to be generated randomly. Please find the results below.
I also encountered the same problem as you, and when I set the Temperature
to 0, the output of music understanding with different audio inputs and the same prompt is exactly the same. I am very confused about this.
Thank you for your hard work. Seems that the issue is maintaining. Does not matter the default values in gradio or if it is change. Will be any possible update soon?
Thank you for your hard work. Seems that the issue is maintaining. Does not matter the default values in gradio or if it is change. Will be any possible update soon?
We're working hard to update our model to minimize the hallucinations with new strategies. Stay tuned.
What was the inference speed and what hardware specifications did you use to run M2UGen? @MinnieUoL
Thank you for sharing this valuable work. However, I encountered some peculiarities when testing the model. It appears that the performance significantly deviates from what is demonstrated on the demo page. I'm curious if others have observed similar discrepancies, or perhaps I've made an error. If that's the case, I would appreciate any corrections from the authors.
I adhered to the instructions provided in the README to establish the gradio environment and utilized the M2UGEN-MusicGEN-Medium model for inference, which performs best in the paper. Below are the screenshots of the results.
For music understanding: While the generated audio seems acceptable, the description is entirely inaccurate.
For music editing: The edited music retains its original piano sound.
For image-to-music. I tested three common instruments with simple images, but the recognition was completely off, and the music generated bore no relevance to the image.
I tried to replicate a result from the demo page. However, despite running the experiments three times, I was unable to achieve the perfect results showcased in the demo.
I tried to generate music that would suit the mood of an image without any instrument constraints. The image's mood should be perceived as exciting and fast-paced by common standards. However, the system returned a slow and calm piece of music.
All of the above experiments were conducted without cherry-picking and were generated on the first attempt. Please do not hesitate to correct me if I have made any mistakes.