How to select the proper model to test on the videodataset?

JixiangChen-Jimmy commented 1 year ago

Hi, thanks for your interesting work. But I noticed you did not split a validation set to choose which model trained during epochs to be saved for further testing. You directly used the test dataset during training. According to your code, you reported the final results as the one averaged by the best results on the test dataset of different split indexes during the training process. It is not the right way to report the result.

shinkyo0513 commented 1 year ago

Hi Jimmy, thanks for your interest in our work. We follow the standard data split forms of the JIGSAWS dataset. The models are trained on the training set and tested on the validation set. The cross-validation schemes alleviate the random fluctuations in test results caused by small datasets. In the experiment setting, we also follow many previous works, such as MTL-VF.

JixiangChen-Jimmy commented 1 year ago

Hi Zhenqiang, Thanks for your reply! I can see that you follow the standard data split forms of the JIGSAWS dataset from your dataset code. I understand that in small datasets, random fluctuations are commonly seen, so compared to the way to use the results of the best epoch and the final epoch on the test dataset, a gap in performance that can be seen. I want to make my questions clear: did you treat the standard split dataset in JIGSAWS as a 'train-validation' split or 'train-test' split? That is to say, in the video split files provided under 'Experimental_setup/task_name/unBalanced/GestureRecognition/split_type/x_out/itr_1' folder, the videos in 'Test.txt' are used for testing or simply as validation, so no test is actually performed? What makes me confused is the experiments under LOUO settings using your way to report results. During reproducing your results following the settings provided, comparing to the performance obtained by the 'best epoch' or 'last epoch' or 'average over the last several epochs', there is a gap of around 0.2 of Rho in LOSO/4Fold setting, but generally, the model predictions tend to converge in last epochs, and the converged results is not too different from the 'best epoch' result (0.2 gap in Rho on average). But in LOUO settings, I notice that during training, in some epochs, especially the early epochs, the model does not converge properly, and it gives predictions with very large absolute errors (in general larger than 5, sometimes even higher). But it kind of accidentally results in a similar rank correlation with the test split distribution, and when the absolute error starts to converge in final epochs, the converged Rho is very different from the ones of the 'best epoch' (e.g. -0.3 v.s. 0.7). So the reproduced results using the 'last epoch' and 'best epoch' is very different in LOUO settings (by averaging results of three task in LOUO settings under 5 rand seed, I get only 0.04 Rho)

JixiangChen-Jimmy commented 1 year ago

Here I provide an example of the ploted result (split index 4 in Needle Passing), the orange line is the testing result and the blue line is the training result.

shinkyo0513 commented 1 year ago

Thanks for your detailed explanations. I really appreciate that you run our code carefully. Actually, I also observed a similar phenomenon when I was running another existing method as the baseline. Based on my analysis, the results of the ranking correlation tend to be unstable, especially when the test/validation set is small (e.g., less than 5 samples). Hence, compared with the ranking correlation, I personally prefer the l1 error (or MAE) as the evaluation metric (as done by this paper recently). Also, I think the training & validation on all videos across the three tasks of JIGSAWS have more stable results (as what we did in the ablation study). You can also find the answer to your question in the experiment details part of our paper. I hope my answer can help you with your experiments.

JixiangChen-Jimmy commented 1 year ago

Thank you for the reply! I did observe this phenomenon in experiments of other methods, such as MTL-VF (the one mentioned). That is why I am so confused about the experiment results of LOUO settings in JIGSAWS, as I can not reproduce the performance of many state-of-the-art methods. In such a small dataset where some split indexes of a single task have a small number of test videos, the ranking correlation tends to be less meaningful, as even a large l1 error can result in a small ranking correlation. I think a larger dataset for surgical skill assessment is required to avoid such an issue. Anyway, thanks again for the discussion!

shinkyo0513 / Surgical-Skill-Assessment-via-Video-Semantic-Aggregation

How to select the proper model to test on the videodataset? #2