shlizee / Predict-Cluster

Repository for PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition
Other
107 stars 23 forks source link

About NTU datasets #9

Closed zywbupt closed 3 years ago

zywbupt commented 4 years ago

There are tr_path and te_path in NTU preprocess, how can we get this file? I downloaded the NTU datasets, but i found that the file are with name of .skeleton. And after processing these data, i get the file:train_data_joint.npy train_label.pkl val_data_joint.npy val_label.pkl. I wonder which one is the needed?

DragonLiu1995 commented 4 years ago

Hi zywbupt, thank you for your question! We actually put the label and action sequence together in pickle file, and the train test split is following the original evaluation protocol proposed by NTU-RGBD paper, you can refer to their rules of spliting the training and test data and also how to split data by cross-subject or cross-view. If I remember correctly, the .skeleton file has a sequence number, there's a 'P' in the number, and the 3 numbers followed by that letter 'P' is what we refer to when spliting the data, "1, 2, 4, 5, 8, 9, 13, 14, 15, 16, 17, 18, 19, 25, 27, 28, 31, 34, 35, 38" is for training, and all other ones are for testing. Correct me if I'm wrong.

zywbupt commented 4 years ago

Thank you for you answer, but what is the format of the input sequence data, is only the x,y,z value of each joint in depth or rgb? As for the generated train_data_joint.npy file the format is in (N, C, T, V, M)

fmthoker commented 4 years ago

@DragonLiu1995 Thanks for releasing the code. I was checking how you pre-process ntu dataset and its not clear what should be the shape of each input sample in the follwoing code, # Normalize Bones for i in range(len(train_data)): train_data[i]['input'] = normalize_bone(np.array(train_data[i]['input'])) for i in range(len(test_data)): test_data[i]['input'] = normalize_bone(np.array(test_data[i]['input']))

As I understood, in NTU-dataset each video is represented by [P x F x J x C] array, where P=2 persons with 1 being all zeros in case of the single person action, F= Number of frames, J= Number of Joints, and C = 3D joint coordinates. Can you explain the shape of each train_data[i]['input'] that goes into the normalization function?

DragonLiu1995 commented 4 years ago

Thank you for you answer, but what is the format of the input sequence data, is only the x,y,z value of each joint in depth or rgb? As for the generated train_data_joint.npy file the format is in (N, C, T, V, M)

The input sequence is x,y,z coordinate of each joint, 25 joints with (x, y, z).

DragonLiu1995 commented 4 years ago

@DragonLiu1995 Thanks for releasing the code. I was checking how you pre-process ntu dataset and its not clear what should be the shape of each input sample in the follwoing code, # Normalize Bones for i in range(len(train_data)): train_data[i]['input'] = normalize_bone(np.array(train_data[i]['input'])) for i in range(len(test_data)): test_data[i]['input'] = normalize_bone(np.array(test_data[i]['input']))

As I understood, in NTU-dataset each video is represented by [P x F x J x C] array, where P=2 persons with 1 being all zeros in case of the single person action, F= Number of frames, J= Number of Joints, and C = 3D joint coordinates. Can you explain the shape of each train_data[i]['input'] that goes into the normalization function?

Each training sample is 1 person. If there are 2 people in the video, we separate them into 2 sample sequences.

fmthoker commented 4 years ago

@DragonLiu1995 Thanks for the clarification. However, I am getting NAN values in the normalization function when passing raw skeleton video sequences. Also, It is also not clear where have you computed the R matrix which is discussed in the paper. Can you make the pre-processing part more explainable, is there something that I am missing?

sukun1045 commented 4 years ago

@fmthoker Hi, thanks for your interest again. So the current preprocess script is the processing part before feeding into the network. The data has been cleaned by a set of operations and also applied the view invariant transform. I just upload the view invariant transform example for NTU dataset. You can check the details. To save your time, I also share the processed data in the Google Drive. The raw_train/test_data.pkl are the clean data after removing noise/two persons situations etc. The trans_train/test_data.pkl are data after applying the view invariant transform.

fmthoker commented 4 years ago

@sukun1045 Thank you so much for this, it is a life saver

fmthoker commented 4 years ago

@DragonLiu1995 @sukun1045 Can you please let me know whether the without training results for NTU P&C Rand (Our) 56.4 39.6 in the paper are obtained with AEC or without. Without Training, I am able to achieve KnnACC W/O-AEC: 0.3901 W-AEC: 0.4398 for cross-view and KnnACC W/O-AEC: 0.3346 W-AEC: 0.3718 for cross-subject evaluation.

sukun1045 commented 4 years ago

@fmthoker Those results are obtained without training anything and depending on the initialization of network or possibly the framework (We have only tested it on TensorFlow). The reason to list P&C Rand is to show that even a random Encoder has already given some reasonable accuracy for such as a large dataset and a good training process should be able to improve this accuracy instead of getting lower or the same accuracy as shown in the LongT GAN case.

fmthoker commented 4 years ago

@sukun1045 Thanks for the quick response, yes I am aware of that, the numbers that I mentioned above are also obtained without any training. However, I just wanted to know what initialization and framework (especially with or without AEC) did you use to obtain the reported numbers P&C Rand (Our) 56.4 39.6.

sukun1045 commented 4 years ago

@fmthoker We were using Tensorflow 1.14 and the code for the architecture would be similar to what was shown in the UCLA_demo.ipynb without AEC. For initialization, we use the default for GRU in tf 1.14 (random uniform initialization∈[−0.05,0.05])).

fmthoker commented 4 years ago

@sukun1045 Thanks for the reply again. I also tried to reproduce the number using your TensorFlow code and the provided pre-processed data. However, the numbers don't match the reported results. Here are my results configurations using the TensorFlow (version 1.14) without any modifications. I used a batch size of 64, rnn size = 2048, sequence len = 50, feature_size 75 with fixed-state strategy. All the other hyperparameters are the same as in the UCLA_demo.ipynb. Cross-subject = 0.48 and Cross-view=0.60. Incidentally, these numbers are close to what I get, when I use the PyTorch Implementation.

sukun1045 commented 4 years ago

@fmthoker would you mind telling me if the PC Rand matches the reported value? I am cleaning up our previous code and rerun the experiment. It may take about 8 hours to get the final result but to give you an initial point, the knn score for the CS case should have about 56% (I just made the screen shots of what I just got from the initial random model and initial training). I will notice you when I am ready to publish the cleaned version of the notebook. image image

fmthoker commented 4 years ago

@sukun1045 Thanks for the quick response. Without any training, I am getting 0.4812 for cross_view and 0.3915 for the cross_subject evaluation. If possible can you share the training script, once you finish the experiment?

sukun1045 commented 4 years ago

@fmthoker Yes I will do that.

sukun1045 commented 4 years ago

@fmthoker you can check the NTU_demo notebook. It is a quick implementation for NTU cross view with fixed state strategy without AEC. It should have around 75% after 15000 iterations.

fmthoker commented 4 years ago

@sukun1045 Thanks for sharing the script. I was able to reproduce the results with some margin. I would like to point out the problem was due to the bone normalization part which is present in the ntu_preporcess.py. Also, I removed the bone normalization part from the PyTorch code too and it also seems to work better now.

sukun1045 commented 4 years ago

@fmthoker Oh I see. It was an old normalization code that was tried before and it was accidentally added into the repo. Sorry about that.

fmthoker commented 4 years ago

@sukun1045 It is fine. However, I was wondering whether you also performed a linear evaluation experiment by training only a fully connected layer using extracted features and labels.

sukun1045 commented 4 years ago

@fmthoker Yes we tried, but it added extra parameters and didn't perform well.

fmthoker commented 4 years ago

@sukun1045 So with fine-tuning only the last layer with labels was worst than just clustering the final features using KNN without any labels. Do you have the exact numbers?

sukun1045 commented 4 years ago

@fmthoker From our previous experiments, fixing the trained encoder and fine-tuning only the classifier would provide about 61% acc for Cross View, which is very similar to what you get if you just fix the random initialization for encoder and train the one classifier. If you train the encoder and classifier jointly in the supervised setting, the acc is about 80%. Hope these information are helpful.

fmthoker commented 4 years ago

@sukun1045 Thanks for this information, it is really helpful. But, It seems a little strange that fine-tuning a classifier with using trained features performs worst if you simply cluster the same features without any labels. What do you think is the reason for this?

sukun1045 commented 4 years ago

@fmthoker I don't have a clear answer for that but from my understanding, the self-supervised training task (regeneration) may not learn exactly the same features as what are usually seen in supervised classification. I think the representations learned by either case can somehow separate the actions but they are not acting like the same way.

fmthoker commented 4 years ago

@sukun1045 I do understand your point. But I am just wondering about how would we compare new self-supervised methods with this method fairly. Using a linear evaluation your method does not seem to perform well while using clustering for evaluation your method seems to learn good representations. Both cases have different use cases.

sukun1045 commented 4 years ago

@fmthoker To me, it only makes sense to compare these two cases separately. Evaluation via knn will show the effectiveness of the pure representation you retrieve from self-supervised method. Result of fine-tuning one classifier can demonstrate the flexibility of the learned representation for various coming tasks.

fmthoker commented 4 years ago

@sukun1045 Thanks for the discussion and overall help. I will get back to you for more questions in the future.

fmthoker commented 4 years ago

@sukun1045 FYI. I reproduced the results for using Pytorch implementation for the NTU-60 dataset. Here are some important points/results.

It seems the number of neighbours impacts the performance differently for cross-view and cross-subject training. Also, using PyTorch the network seems to converge very quickly, for cross-view only after 15 epochs and for cross-subject only after 5 epochs. Training for more epochs resulted in a decrease in performance.

One more thing, would it be possible for you to share your implementation of this paper LongT GAN [36]. I need to do a linear evaluation of this method too.

khurramHashmi commented 4 years ago

Hi @fmthoker , You have mentioned that you have reproduced the results using Pytorch. Could you please care to share the repository?

Thanks!

fmthoker commented 4 years ago

@khurramHashmi The PyTorch code is already in the repository, I just added a new data loader for the NTU dataset for training data files provided in the above-mentioned google drive link (link). I must say the reproduced results are not exactly the same as the TensorFlow implementation.

fmthoker commented 4 years ago

@DragonLiu1995 @sukun1045 Can you please mention how you created raw data files (raw_train/test_data.pkl ) from the original .skeleton files for NTU-60. Actually, I am trying to do the same for NTU-120, but I get a (divide by zero) error during the view-invariant transformation script. Can you share how to convert .skeleton files provided by the dataset authors into the above-mentioned format?

DragonLiu1995 commented 4 years ago

Can you please mention how you created raw data files (raw_train/test_data.pkl ) from the original .skeleton files for NTU-60. Actually, I am trying to do the same for NTU-120, but I get a (divide by zero) error during the view-invariant transformation script. Can you share how to convert .skeleton files provided by the dataset authors into the above-mentioned format?

Hi @fmthoker, The steps to get (raw_train/test_data.pkl ) are included in NTU60_preprocess.ipynb under the preprocess folder. There are 3 stpes to get the final skeletons. Code for step 1 & 2 are adjusted from this repo. step 3 is simply splitting the data by Cross-View/Cross-Subject scheme.

fmthoker commented 4 years ago

@DragonLiu1995 Thank you for your help I was able to run the code for NTU-120 now.