I would like to request the index number of the data featured on your poster.

Hi, sweety.

We are currently attempting to reproduce your research, but we are facing some difficulties. We would like to try the experiment using the data you presented in your poster. Could you kindly provide the index number of the pair?

The specific challenge we are facing is as follows:

We attempted several modifications for pairID=6. The modifier we used is from /posescript/data/PoseFix/posefix_release/posefix_auto_135305.json, corresponding to pairID=6, which reads as follows: "Flatten both legs and bring your left arm, both your legs, and both your shoulders down slightly. Then bring your left elbow a bit to the front and swing your shoulders and your body forward. Your thighs need to be flat, lower your right hand a bit, bring it backward slightly, and ensure both hands are approximately shoulder-width apart, with them resting on the ground."

We simplified this modifier into several steps and visualized the process as follows. First, we applied the modifier to the initial pose, pose_A. Then, we fed the resulting pose back into the process and applied the next modifier. The step-by-step process is outlined below.

The ground truth B pose is as follows, and we can clearly see that it differs significantly from the final pose generated through the process.

When we applied the entire modifier without simplifying it, the results were as follows. We can still observe a noticeable difference from the ground truth B pose.

Below is the code used for the experiment. It is a slightly modified version of your code located at /posescript/src/text2pose/generative_B/demo_generative_B.py. Is there something wrong with my code?

##############################################################
## text2pose                                                ##
## Copyright (c) 2023                                       ##
## Institut de Robotica i Informatica Industrial, CSIC-UPC  ##
## and Naver Corporation                                    ##
## Licensed under the CC BY-NC-SA 4.0 license.              ##
## See project root for license details.                    ##
##############################################################

import argparse
import torch
import numpy as np

import text2pose.config as config
import text2pose.demo as demo
import text2pose.utils as utils
import text2pose.data as data
import text2pose.utils_visu as utils_visu
from text2pose.generative_B.evaluate_generative_B import load_model
import json
import os
from PIL import Image

parser = argparse.ArgumentParser(description='Parameters for the demo.')
parser.add_argument('--model_paths', nargs='+', type=str, help='Paths to the models to be compared.')
parser.add_argument('--checkpoint', default='best', choices=('best', 'last'), help="Checkpoint to choose if model path is incomplete.")
parser.add_argument('--n_generate', type=int, default=1, help="Number of poses to generate (number of samples); if considering only one model.")
args = parser.parse_args()

file_path = "/home/chanhun/posescript/data/PoseFix/posefix_release/pair_id_2_pose_ids.json"

with open(file_path, 'r') as file:
    pose_pairs = json.load(file)

file_path2 = "/home/chanhun/posescript/data/PoseFix/posefix_release/ids_2_dataset_sequence_and_frame_index_100k.json"

with open(file_path2,'r') as file:
    dataID_2_pose_info = json.load(file)

pair_ID = 6
pid_A = str(pose_pairs[pair_ID][0])
pid_B = str(pose_pairs[pair_ID][1])
pose_A_info = dataID_2_pose_info[pid_A]

ret = utils.get_pose_data_from_file(pose_A_info,
                                            applied_rotation=None,
                                            output_rotation=False)
pose_A_data = ret[0].reshape(-1, 3)
pose_A_data = pose_A_data.reshape(1,-1)

modifier = "Flatten both legs and bring your left arm, both your legs and both your shoulders down slightly then bring your left elbow a bit to the front and swing your shoulders and your body forward and your thighs need to be flat and lower your right hand a bit, bring it backward a little and both your hands must be approximately shoulder width apart, they should be down on the ground."

n_generate = args.n_generate

model, _, body_model = demo.setup_models(args.model_paths, args.checkpoint, load_model)

# last_step = np.load("/home/chanhun/posescript/generated_B/step_2_of_3.npy")
# last_step = torch.from_numpy(last_step)

last_step = None

# --- seed
torch.manual_seed(42)
np.random.seed(42)

output_dir = "/home/chanhun/posescript/generated_B"
os.makedirs(output_dir, exist_ok=True)

with torch.no_grad():
    if last_step is None:
        gen_pose_data_samples = model[0].sample_str_meanposes(pose_A_data, modifier)['pose_body'][0,...].view(n_generate, -1)
    else:
        gen_pose_data_samples = model[0].sample_str_meanposes(last_step, modifier)['pose_body'][0,...].view(n_generate, -1)

    gen_pose_data_np = gen_pose_data_samples.cpu().numpy()
    output_file = os.path.join(output_dir, "chanhun.npy")
    np.save(output_file, gen_pose_data_np)
    img = utils_visu.image_from_pose_data(gen_pose_data_samples, body_model, color='blue', add_ground_plane=False, two_views=0)
    img = np.array(img[0])
    img_pil = Image.fromarray(img)
    img_path = os.path.join(output_dir, "chanhun.png")
    img_pil.save(img_path)

Thank you for taking the time to read this long message.

Hello.

The test split indices / IDs of the pairs in the provided images:

0 / 8
15 / 79
32 / 168
38 / 198

Please note that you may obtain different qualitative results than in the paper, as the released model is a model retrained after the code release, and is not exactly the model used at writing time to generate the qualitative samples.

Disclaimer: I did not read the code. But perhaps, these few comments could help to (attempt to) explain the results:

The model is not perfect, and will probably not manage to generate a pose that really looks like the ground truth pose (especially since this pose B is pretty peculiar, at the leg level, compared to the rest of the dataset).
A model finetuned on human-written texts may underperform a model trained solely on automatic texts, on a automatic sample (like the one of the example). Reason: domain gap (human-written texts are shorter, and more straightforward).
Splitting the full modifying text into small pieces to make iterative modifications may induce a loss of information in the sense that the different pieces would not interact together anymore, and so lead to a fuller understanding of the required modification. For instance (assuming the model is already pretty good), having "move your shoulder forward" in one step, and "move you arm back" in another may lead to opposite modifications; while having both together would help to get a better idea of the desired modification (of what should really change).
I think it would be interesting & valuable to train a model on iterative, atomic changes, a bit as suggested in the presented experiment. One may have to mine relevant pose pairs for this (maybe by tweaking a few parameters in pose A to obtain pose B; but then comes the question of plausibility). The automatic comparative pipeline could help to get atomic, local descriptions, but perhaps it would have to be modified a bit to gain in precision/resolution too.
In further experiments, I found I was obtaining lower performance when training on modifying texts that have pieces of instruction like "A should be...; B must be...", deriving from the direct use of posecodes along with paircodes, in the automatic comparative pipeline. Because these sentences describe pose configurations that should not change between pose A and pose B, they basically represent noise from a modifying perspective, and they confuse the model. Even though for us human, they sound right.

naver / posescript

I would like to request the index number of the data featured on your poster. #27