If pose_matching is set to False, would this make only the spatial graph convolutions work on a frame basis with no temporal connections? In this case, are features from each frame just stacked together? How would the temporal information work? Or am I missing some subtle information here? It is awkward that I get better performance when it is set to false, considering that temporal information is important in action classification.
Also, is the kernel size set at 9, related to the number of frames (temporal dimension)?
No. If it is set to FALSE we do not tracking and assumes we always assume the skeleton data of the persons in each frame are sorted by their IDs when forming the sequences.
If pose_matching is set to False, would this make only the spatial graph convolutions work on a frame basis with no temporal connections? In this case, are features from each frame just stacked together? How would the temporal information work? Or am I missing some subtle information here? It is awkward that I get better performance when it is set to false, considering that temporal information is important in action classification.
Also, is the kernel size set at 9, related to the number of frames (temporal dimension)?