h-y1heng / StableMoFusion

MIT License
49 stars 5 forks source link

Why can your model's performance exceed the real? #6

Open Silverster98 opened 1 month ago

Silverster98 commented 1 month ago

Why can your model's performance exceed the real (about 4% in R-Precision)? By the way, I noticed that you don't report the MM-Dist like other works. How about the MM-Dist score in your model?

Dai-Wenxun commented 2 weeks ago

I think that's the reason of the seed. When I was testing my MotionLCM, I found that the performance of the real can exceed the reported official results by a small margin. The number of the real is fluctuant actually.

Another important reason is the evaluator is too weak. hahaha! The community needs new and more robust metrics.

h-y1heng commented 1 week ago

Why can your model's performance exceed the real (about 4% in R-Precision)? By the way, I noticed that you don't report the MM-Dist like other works. How about the MM-Dist score in your model?

I'm sorry that I accidentally overlooked your question for so long due to being busy with other courses during this period. I'm very, very sorry!

I believe there are two main reasons for this observation. First, as @Dai-Wenxun mentioned, the R-Precision metric may not be robust enough. This metric only performs similarity matching for a batch size of 32. However, the HumanML3D dataset contains many similar or even identical, detailed or rough texts, making batch samples a critical factor. Different batch allocations can lead to varying results. As the motion generation field has evolved over the past few years, I personally think that this metric no longer precisely assesses the fine-grained matching between motion and text, this results does not necessarily mean that our generated data is semantically superior to the real data.

Second, the HumanML3D dataset itself includes noisy data. For instance, some motions do not suit the mirroring augmentation used during data preprocessing, leading to anomalous annotations.

These are my current views.

LinghaoChan commented 1 week ago

@h-y1heng Hi yiheng. In my experience, the R-P is evaluated by a pre-trained model. We cannot prevent the bias caused by deep models. BTW, I discussed similar issues and other questions with you via email. Could you please have a look~ (^_^)

h-y1heng commented 1 week ago

@h-y1heng Hi yiheng. In my experience, the R-P is evaluated by a pre-trained model. We cannot prevent the bias caused by deep models. BTW, I discussed similar issues and other questions with you via email. Could you please have a look~ (^_^)

@LinghaoChan, thank you for your insights regarding the R-Precision evaluation. However, I haven't received your email regarding this and other questions. Could you please resend it to hyh654@bupt.edu.cn? I'm looking forward to our discussion. Thank you!

LinghaoChan commented 1 week ago

@h-y1heng Hi yiheng. In my experience, the R-P is evaluated by a pre-trained model. We cannot prevent the bias caused by deep models. BTW, I discussed similar issues and other questions with you via email. Could you please have a look~ (^_^)

@LinghaoChan, thank you for your insights regarding the R-Precision evaluation. However, I haven't received your email regarding this and other questions. Could you please resend it to hyh654@bupt.edu.cn? I'm looking forward to our discussion. Thank you!

@h-y1heng sent~