question about input data format / using different pose extractors

ronjamoller commented 4 years ago

Hi, i have just started looking into geometric learning and as a first try i want to get the network running in my environment. My issue is that i am not using the joints from openpose so my "input" is formatted in a different way. I am specifically talking bout N, C, T, V, M = x.size() from forward() and extract_feature(). Going from the paper "Spatial Temporal Graph Convolutional Networks for Skeleton-Based ActionRecognition" I am guessing that N is the number of joints, C is the number of channels of the feature (2 for 2d joint positions), T is time as in the number of frames that are processed. For V and M i am at the loss and now im stuck because i cant convert my own pose coordinates into the proper format, I would appreciate any help. I tried installing openPose just to explore the data format more but after endless conflicts because of anaconda and cuda mismatches I gave up. tl;dr - what are N, C, T, V, M = x.size() of the pose data ?

frankier commented 4 years ago

I can share my own findings (based on guess/detective work -- not fully tested yet)

If you want to feed stuff into the data pipeline of which goes like this (see also the yaml config files):

normalize_by_resolution,
mask_by_visibility,
**augmentation steps***,
transpose, order=[0, 2, 1, 3],
to_tuple,

then data is (channels, keypoints, frame_num_aka_time, person_id) but you need to pass a dictionary like so:

        return {
            "info": {
                "resolution": [640, 480], # or whatever it is
                "keypoint_channels": ["x", "y", "score"],
            },
            "data": data,
            "category_id": output_class,
        }

If you want to feed stuff directly into then model then it's:

(id_within_minibatch, channels, frame_num_aka_time, keypoints, person_id)

But you probably need to do some kind of normalisation beforehand. Unscaled pixel values probably won't work very well.

ronjamoller commented 4 years ago

hi, thanks a ton for the reply - i think ideally i would want to feed the data directly into my model, i have integrated the st_gcn_aaai18.py with the relevant utils into my pipeline ( its reinforcement learning so i have to integrate it into my setup) and now i would like to convert my pose data (2d screen positions with "confidence" since its simulation based) into a format that the network can use. The scaling is not a problem - i am however unsure about the masking/confidence mechanism and what exactly what entry is. Going from the paper in N, C, T, V, M = x.size() N is the number of joints, C the feature dim (which going from your explanation contains an extra channel for detection confidence if i understand correctly) and T is the time so i guess V an M are ids for minibatches and persons like you mentioned above ? Can you tell me which yaml file you meant in your second sentence ? thanks a lot again ps: if its not too much trouble could you just paste what a print(x) and a print(x.size()) in st_gcn_aaai18.ST_GCN_18.forward() would look like ? I tried to install mmskeleton on my laptop but doing so destroyed my cuda setup for other experiments even though it was in a conda environment and i could not get the nms component to run, again because of cuda im guessing. If only there was a single comment in the code what the letters mean :D

ronjamoller commented 4 years ago

nevermind with the output, 5th time was the charm for the installation :)

frankier commented 4 years ago

I meant the dimensions are in the order given:

(id_within_minibatch, channels, frame_num_aka_time, keypoints, person_id)

So N = id_within_minibatch (hint: use a DataLoader to make minibatches in the 1st dimension) C = channels (x, y, score) OR (x, y) -- has to match num_channels T = frame_num_aka_time V = keypoint/joint (probably stands for vertex) M = person ID (for when there are multiple people within a frame I would suppose)

By the way, I have been passing just (x, y) without score since I'm working with images + OpenPose and I think it might be rather dependent upon camera setup/resolution so would prefer to sacrifice in-domain accuracy for generalisation. It's up to you whether you include score or not.

Would do with the pasting but my code isn't working at the moment.

Here's one of the yaml files used:

https://github.com/open-mmlab/mmskeleton/blob/master/configs/recognition/st_gcn/kinetics-skeleton-from-openpose.yaml

ronjamoller commented 4 years ago

Hi, i just noticed my error with the N as well - thanks for reiterating. I cant use the Dataloader unfortunately because of my RL setup so right now im trying to figure out how the normalization was done (working theory is remove half of width/height and then divide by width/height since its distributed between -0.5 and 0.5). As for score i was just setting the visible joints to confidence 1 but maybe your idea is better. In case anyone is working on a similar issue and is interested: x.size() = [64, 3, 300, 18, 2] , minibatchsize=64, channels x,y,score = 3, T is 2 times the temporal window size which is 150 as per config, V is 18 for kinetics as per paper, for the last im not quite sure, the only thing i can say is that the second "person" seems to be missing quite often during training. this is an output of print(x[0, :, 150, :, 0]) , so one middle frame of the first sample in the minibatch for the first person

tensor([[-0.0380,  0.0420, -0.1280, -0.2930,  0.0000,  0.2120,  0.3430,  0.0000,
         -0.0210,  0.0000,  0.0000,  0.1160,  0.0000,  0.0000, -0.0480,  0.0160,
          0.0000,  0.1260],
        [-0.1630,  0.0050,  0.0110,  0.4430,  0.0000, -0.0220,  0.4160,  0.0000,
          0.4920,  0.0000,  0.0000,  0.4920,  0.0000,  0.0000, -0.2530, -0.2230,
          0.0000, -0.2090],
        [ 0.7610,  0.5250,  0.4730,  0.2970,  0.0000,  0.4380,  0.3610,  0.0000,
          0.0650,  0.0000,  0.0000,  0.0530,  0.0000,  0.0000,  0.8300,  0.7860,
          0.0000,  0.8090]], device='cuda:0')

Good luck with your application and thanks for your help !

open-mmlab / mmskeleton

question about input data format / using different pose extractors #338