open-mmlab / mmskeleton

A OpenMMLAB toolbox for human pose estimation, skeleton-based action recognition, and action synthesis.
Apache License 2.0
2.91k stars 1.03k forks source link

Instructions on applying st-gcn on real-time camera feed #177

Open MengXinChengXuYuan opened 5 years ago

MengXinChengXuYuan commented 5 years ago

In fact, you can find a issue very similar to this one, check #110

lianuo said just feed st-gcn (3,t,18,1) frames, as t he used is 20, I think it's a good solution but if you directly use the pre-trained model, or even trained on your own dataset without modifying the structure of the st-gcn network, the result will be quite terrible!

Look deep into the structure! If you keep temporal size equals to 9 (which means temporal padding will be 4), and temporal pooling 4 times, the temporal receptive field will be around 150 frames! Just around the window size set as default by the author. SO, if you keep the default structure, and feed the network with every 20 frames, there will be too much 0 information (20 real skeleton information, with 8 zero padding), which I think will lead to a dumm result (yap I tried it, the result will just rest the same one action, what ever frames you pack).

So if you want the result to be fine, you should modify the structure, to make it have a temporal receptive field around 20 frames, with as little padding as possible (for me, I set all layers with 0 padding) , to avoid 0 information affecting the real skeleton information (as for the default parameters, the window size is 150, which means 150 real skeleton information, with 8 zero padding, I think it's not good, but won't affect the result that much).

You should set random_choose to false, as your data should have a explicit starting and ending, random_shift to false, of just make window_size equals to frame size.

Also if you want to extend st-gcn to real time applications, you should also add an action which is "none action", or use gcn + rnn/LSTM + CTC to solve this annoying "none action"

GOOD LUCK! Any discussion is welcomed :p

And I would like to ask if you have any interest in real time action detection and recognition/classification? @yjxiong @yjxiong

zws510 commented 5 years ago

@MengXinChengXuYuan Hello, I have encountered this problem now, that is, regardless of the type of data entered, the predictions are all in the same category.I tried 150 frames as input, but the problem still exists.I would be grateful if you could help me.

MengXinChengXuYuan commented 5 years ago

@MengXinChengXuYuan Hello, I have encountered this problem now, that is, regardless of the type of data entered, the predictions are all in the same category.I tried 150 frames as input, but the problem still exists.I would be grateful if you could help me.

You can just read what I've wrote above, which can fully deal with your problem. But if you give 150 frames as the input and still get the same category all the time, that will be quite strange....

You are using the demo.py or you just wrote another demo? For me I wrote a demo using c++ compiled with python, pytorch backend, I met some problem caused by train/forward mismatching, as soon as I correct it, I can get right result using 150 frames (with the default structure). So are you sure that there are no mismatches in you demo?

yjxiong commented 5 years ago

To just run this model in a real-time mode. You can do the following:

  1. Keep a running buffer of the past 150~300 frames' pose estimation output.
  2. After every K frames, run the network on this buffer to generate one output. (K can be 10, 20, 30, 50, up to the buffer size)

However, the output category may not reflect the activity that is happening in the latest frames.

To build a genuine real-time classification model, you should consider using causal convolution in the temporal component of ST-GCN layers.

You are welcome to try building a temporal detection model on top of ST-GCN. Actually this is a very straight forward approach to spatial temporal action detection - detect action segments on each skeleton tracks.

MengXinChengXuYuan commented 5 years ago

To just run this model in a real-time mode. You can do the following:

  1. Keep a running buffer of the past 150~300 frames' pose estimation output.
  2. After every K frames, run the network on this buffer to generate one output. (K can be 10, 20, 30, 50, up to the buffer size)

However, the output category may not reflect the activity that is happening in the latest frames.

To build a genuine real-time classification model, you should consider using causal convolution in the temporal component of ST-GCN layers.

You are welcome to try building a temporal detection model on top of ST-GCN. Actually this is a very straight forward approach to spatial temporal action detection - detect action segments on each skeleton tracks.

“using causal convolution in the temporal component of ST-GCN layers”, you mean causal convolution like rnn/lstm instead of 1 * T convolution in tcn block, right?

yjxiong commented 5 years ago

“Causal” means a filter at time step t can only see inputs that are before t. It is still 1 * T temporal convolution. But its representation will be more biased towards latest frames.

See sec 3.2 and 3.3 of https://arxiv.org/pdf/1803.01271.pdf.

MengXinChengXuYuan commented 5 years ago

“Causal” means a filter at time step t can only see inputs that are before t. It is still 1 * T temporal convolution. But its representation will be more biased towards latest frames.

See sec 3.2 and 3.3 of https://arxiv.org/pdf/1803.01271.pdf.

I see!

xiehaizheng commented 4 years ago

@MengXinChengXuYuan Hello. I am also interested in real-time method , don't know if you have completed the task of this change.Can I refer to your work?