Skeleton matching network (GAT) doesn't learn anything on custom dataset.

mateuszk098 commented 13 hours ago

Hello @ljmanso, @vangiel, and @pilarbachiller,

First of all, thank you for your work! I'm trying to test this approach on the AIST dataset, but I've encountered some issues while using train_skeleton_matching.py.

Here’s the situation: I prepared a custom dataset by parsing a subset of AIST, following a similar structure to what you used with ARP LAB. Since AIST provides several calibrated cameras, I selected six of them, mirroring your setup with ARP LAB. I then created a pickle file containing a TransformManager instance to handle the necessary transformations and added a new configuration to parameters.py. After that, I ran train_skeleton_matching.py, and the training process started successfully (so far, so good).

However, the network doesn’t seem to learn anything. The loss function remains constant across epochs, and training halts due to early stopping. This behavior seems odd since the network's output should evolve over time, even if we would put there something that wouldn't make sense as an input. I have gone through the learning loop and noticed that outputs = torch.squeeze(model(feats.float(), subgraph)) variable is always a vector of ones, so this explains why loss doesn't change.

I cross-verified this using ARP LAB, and everything worked as expected - the loss decreased with each epoch. Could you help me identify what might be going wrong? To provide more context, I’ve attached an example of .json files along with the configuration I’m using. Each .json includes a list of dictionaries with one person seen from six used cameras.

Thank you for your help!

parameters = TrackerParameters(
        image_width=1920,
        image_height=1080,
        cameras=[0, 1, 2, 3, 4, 5],
        camera_names=['c01', 'c02', 'c04', 'c05', 'c06', 'c08'],
        fx=[1483.9361119758073, 1561.166571954539, 1604.3931815081523, 1386.5776557502095, 1667.3561978839223, 1565.408341712705],
        fy=[1483.9361119758073, 1561.166571954539, 1604.3931815081523, 1386.5776557502095, 1667.3561978839223, 1565.408341712705],
        cx=[960.0, 960.0, 960.0, 960.0, 960.0, 960.0],
        cy=[540.0, 540.0, 540.0, 540.0, 540.0, 540.0],
        kd0=[0.1373395977080808, 0.16156897795231054, -0.09945789992662554, -0.15615748075841868, -0.052786711044341406, 0.2414519945785073],
        kd1=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
        kd2=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
        p1=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
        p2=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
        joint_list=JOINT_LIST,
        numbers_per_joint=14, 
        numbers_per_joint_for_loss=4,
        transformations_path='../tm_aist_c01c02c04c05c06c08.pickle',
        used_cameras=['c01', 'c02', 'c04', 'c05', 'c06', 'c08'],
        used_cameras_skeleton_matching=['c01', 'c02', 'c04', 'c05', 'c06', 'c08'],
        used_joints = [0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
        min_number_of_views = 2,
        format=FORMAT,
        graph_alternative='3',
        axes_3D = {'X': (0, 1.), 'Y': (1, 1.), 'Z': (2, -1.)} #For drawing the skeletons: each tuple represents (coordinate index, axis direction)
    )

gLH_sBM_cAll_d16_mLH0_ch01.json

pilarbachiller commented 10 hours ago

Thank you very much for your message. The JSON file you sent looks correct. Could you share the JSON files you are using for training (at least 3 or 4) and the pickle file with the transformations? That way, I can inspect the graphs that are being generated for training to try to find the issue.

Thank you!

mateuszk098 commented 8 hours ago

@pilarbachiller Thanks for your quick response! I managed to solve the issue—it was related to the translation vectors in the TransformManager instance. As far as I can tell, you used meters (?) as the unit for these vectors, but in the AIST dataset, the translation vectors are in centimeters. This mismatch caused large values in the network’s input and output, which the sigmoid activation function mapped to 0s or 1s. After scaling the translations to meters, everything is working perfectly.

That said, I still have a few minor questions about your approach, as there are some aspects I don’t fully understand:

Skeleton Keypoint Values: When constructing the dataset, each skeleton keypoint is represented by five values, such as [0, 867.0, 477.0, 1, 1]. I understand the first value is the keypoint index, and the second and third are the keypoint coordinates. However, what is the purpose of the two 1s at the end? I haven’t found any examples where these values differ from 1. Is this related to the network architecture or something else?
Timestamps and Camera Names: During dataset construction, I noticed that apart from keypoints, there are also a timestamp and a string with the "live_" prefix followed by the camera name. What is the purpose of these two fields?
freqs in MergedMultipleHumanDataset: In the process_training() method, there’s an expression freqs = [0 for _ in range(16)]. When I include more than 16 .json files for training, this causes an index error in freqs[len(views_to_add)] += 1. Could you clarify the role of the freqs variable? Is it possible to adjust its length? For instance, if I want to include more than 16 .json files for training (e.g., lightweight files of 1–5 MB but many of them), can this list be extended without any potential issues?
JSON File Counts in Training: The README mentions: The lists of JSON files specified for each option should contain more than 1 file. The number of files in the training set determines the number of people the model will learn to distinguish. Does this mean that if, for example, I provide five different .json files representing sequences from five different individuals, the skeleton matching network will learn to distinguish between five people? What about I want to distinguish between e.g. 20 people? This question ties into my third question regarding the fixed freqs list.

Questions 3. and 4. are the most important for me. Thank you for your time and support!

gnns4hri / 3D_multi_pose_estimator

Skeleton matching network (GAT) doesn't learn anything on custom dataset. #1