tr3e / InterGen

[IJCV 2024] InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions
https://tr3e.github.io/intergen-page/
214 stars 10 forks source link

[duplicate prompts found]Clarification on Prompt Diversity and Action Generation #45

Open bring-nirachornkul opened 18 hours ago

bring-nirachornkul commented 18 hours ago

Dear Intergen,

I have been reviewing the dataset for the InterGen project and noticed that many prompts for specific actions, such as "dancing," are highly similar, with minimal variations in wording. Below are 50 examples related to dancing from the dataset:

Dancing : 50 sequences

5088 - two people are dancing together.
5320 - two people practice dancing together.
5326 - two individuals are dancing together.
5382 - two individuals are dancing together.
5416 - two people are dancing together.
5504 - two people are dancing together.
5559 - two people are dancing together.
5587 - two people are dancing together.
5716 - the two persons are dancing together.
5782 - two people are dancing together.
5825 - two individuals are dancing together.
5917 - two people are breakdancing.
5926 - two people are dancing together.
5941 - two individuals are dancing together gracefully.
5997 - two individuals are dancing together.
6011 - two persons are dancing.
6035 - the two individuals are dancing together.
6043 - two people are dancing in pairs.
6077 - two individuals are dancing separately.
6096 - two people are dancing together.
6145 - two individuals are dancing together.
6159 - two individuals are dancing together.
6232 - two people are dancing together.
6237 - two individuals are dancing together.
6247 - two people are dancing together.
6286 - the two individuals are dancing together.
6299 - two persons are dancing together.
6311 - the two people are dancing together.
6401 - two people are dancing together.
6409 - they are dancing together.
6420 - the two persons are dancing together.
6436 - two people are dancing together.
6466 - two people are dancing together.
6478 - two individuals are dancing together.
6495 - two people are dancing together.
6506 - two persons are dancing together.
6533 - the two individuals are dancing together.
6544 - two people are dancing together.
6568 - the two individuals are dancing together.
6587 - the two individuals are dancing together.
6596 - two individuals are dancing together.
6619 - two individuals are dancing together.
6629 - the two persons are dancing.
6671 - two people are dancing gracefully.
6739 - two persons are dancing together.
6744 - two persons are dancing together.
6867 - the two people are dancing together.
6870 - two people are dancing together.
6877 - two people are dancing.
6939 - two people are dancing a ballroom dance together.
6944 - the two individuals are dancing together.

taichi : 17 sequences

2851 - two individuals are practicing tai chi together.
2855 - two individuals are practicing tai chi.
2863 - two people are practicing tai chi.
2867 - two individuals are practicing tai chi.
2913 - two individuals are practicing tai chi.
2918 - two people practicing tai chi.
2922 - two people are practicing tai chi.
2929 - two people are practicing tai chi.
2956 - two persons are practicing tai chi.
2963 - two people are practicing tai chi.
2967 - two individuals are practicing tai chi.
2986 - two individuals are practicing tai chi.
3683 - two people are practicing tai chi together.
3771 - two people are practicing tai chi.
4479 - two people are practicing tai chi.
4952 - two individuals are practicing tai chi.
7059 - two people practicing tai chi together.

sparring : 28 sequences

562 - two people are sparring in taekwondo, exchanging kicks with one another.
635 - the two are sparring in taekwondo.
1399 - the two are sparring in taekwondo, exchanging kicks and strikes.
1716 - two performers are sparring in the ring, throwing punches at one another.
3017 - two persons are sparring using fists.
3030 - two individuals are sparring with each other.
3055 - two persons are sparring with each other.
3057 - two individuals are sparring with each other.
3059 - two individuals are sparring with each other.
3137 - the two people are sparring with martial arts techniques.
3246 - two individuals are sparring with each other.
3249 - two individuals are sparring against each other.
3253 - two individuals are sparring with each other.
3256 - two individuals are sparring with each other.
3258 - two individuals are sparring with each other.
3260 - two individuals are sparring with each other.
3591 - two individuals are sparring with each other.
3593 - two people are sparring against each other.
3595 - two persons are sparring with each other.
3597 - the two people are sparring in martial arts.
3673 - two people are sparring with each other.
3675 - two individuals are sparring with each other.
3677 - two individuals are sparring each other.
3679 - two people are sparring against each other.
3681 - two people are sparring against each other.
3855 - two individuals are sparring with each other.
3857 - the two people are sparring in martial arts.
3859 - two individuals are sparring with each other.

rock-paper-scissors : 4 sequences

2753 - two individuals are playing a game of rock-paper-scissors.
2756 - two individuals are playing a game of rock-paper-scissors.
2759 - two people are playing a game of rock-paper-scissors.
3381 - the two people are playing rock-paper-scissors.

Given this level of similarity, could you clarify how the model is expected to generate distinct and meaningful actions based on such closely related prompts? Additionally, do these similar tokenized inputs limit the diversity of generated actions, and if so, how does the system address this?

I appreciate your time in clarifying this matter.

Best regards,

Phongsiri

tr3e commented 13 hours ago

Dear Phongsiri,

Thank you for your thoughtful review and for bringing up the concerns regarding the similarity in the dataset prompts, specifically related to the action "dancing". Your observations are insightful.

To address your concerns:

The dataset intentionally includes prompts with minimal variations to test the model's sensitivity to subtle linguistic cues that might influence the generated motions. For instance, terms like "gracefully" or "breakdancing" indicate different styles and energies of dancing, which are meant to prompt slight variations in the generated motions. This is part of our effort to refine the model's ability to discern and react to nuanced differences in human interaction descriptions.

Although the textual annotations are similar (since the semantic category of these motions are "dance"), each motion captured is unique, which does not limit but rather enhances the diversity. For example, the diffusion model inherently has the capability to model such diversity effectively. Hence, similar annotation in this context is not a problem but an opportunity to refine the model's ability to generate nuanced variations of similar actions.

Thank you once again for your interests in our work and the detailed review.

Best regards, Han

bring-nirachornkul commented 12 hours ago

Dear Han,

Thank you for your earlier response. While I appreciate the augmentation methods mentioned in the paper, they appear to be tied primarily to the evaluation process rather than addressing redundancy in the raw dataset.

As I began training for over 20,000 epochs, I noticed some concerning patterns:

This raises two key questions:

  1. Do the reported 7,779 sequences include repeated motions with slightly different descriptions?
  2. Were blank sequences, like those above, considered in the dataset statistics, and if so, how were they addressed during training?

These issues impact the diversity and usability of the dataset during real-world training. I’d greatly appreciate clarification on how these potential redundancies and anomalies are handled in the dataset preparation and evaluation stages.

Best regards,

Phongsiri

tr3e commented 8 hours ago

hi, to address your questions:

  1. No, each sequence represents a unique motion. The descriptions may appear similar because all sequences fall under a broad category and are sub-segments of recorded long-term motions within this category. Consequently, the annotators may use similar language to describe them due to their semantic similarities.

  2. The sequences you mentioned are transition motions, not blank ones. In our experiments, we retained these samples.