Open bring-nirachornkul opened 18 hours ago
Dear Phongsiri,
Thank you for your thoughtful review and for bringing up the concerns regarding the similarity in the dataset prompts, specifically related to the action "dancing". Your observations are insightful.
To address your concerns:
The dataset intentionally includes prompts with minimal variations to test the model's sensitivity to subtle linguistic cues that might influence the generated motions. For instance, terms like "gracefully" or "breakdancing" indicate different styles and energies of dancing, which are meant to prompt slight variations in the generated motions. This is part of our effort to refine the model's ability to discern and react to nuanced differences in human interaction descriptions.
Although the textual annotations are similar (since the semantic category of these motions are "dance"), each motion captured is unique, which does not limit but rather enhances the diversity. For example, the diffusion model inherently has the capability to model such diversity effectively. Hence, similar annotation in this context is not a problem but an opportunity to refine the model's ability to generate nuanced variations of similar actions.
Thank you once again for your interests in our work and the detailed review.
Best regards, Han
Dear Han,
Thank you for your earlier response. While I appreciate the augmentation methods mentioned in the paper, they appear to be tied primarily to the evaluation process rather than addressing redundancy in the raw dataset.
As I began training for over 20,000 epochs, I noticed some concerning patterns:
4193 - transition
4385 - transition
4434 - transition
6028 - transition
6940 - transition
7220 - pass
7221 - pass
This raises two key questions:
These issues impact the diversity and usability of the dataset during real-world training. I’d greatly appreciate clarification on how these potential redundancies and anomalies are handled in the dataset preparation and evaluation stages.
Best regards,
Phongsiri
hi, to address your questions:
No, each sequence represents a unique motion. The descriptions may appear similar because all sequences fall under a broad category and are sub-segments of recorded long-term motions within this category. Consequently, the annotators may use similar language to describe them due to their semantic similarities.
The sequences you mentioned are transition motions, not blank ones. In our experiments, we retained these samples.
Dear Intergen,
I have been reviewing the dataset for the InterGen project and noticed that many prompts for specific actions, such as "dancing," are highly similar, with minimal variations in wording. Below are 50 examples related to dancing from the dataset:
Dancing : 50 sequences
taichi : 17 sequences
sparring : 28 sequences
rock-paper-scissors : 4 sequences
Given this level of similarity, could you clarify how the model is expected to generate distinct and meaningful actions based on such closely related prompts? Additionally, do these similar tokenized inputs limit the diversity of generated actions, and if so, how does the system address this?
I appreciate your time in clarifying this matter.
Best regards,
Phongsiri