Tried to reproduce results

IsabelFunke commented 5 days ago

Hello,

congrats on your interesting paper and kudos for publishing the code! Looks really cool!

I followed steps 2.1 and 2.2 to train feature cache (which seems to equal the BNpitfalls method) and DACAT on Cholec80. Unfortunately, I achieved slightly lower numbers than those in the paper:

Method	Accuracy	Macro Jaccard
BNpitfalls	92.6 +- 5.1	78.8 +- 9.5
DACAT	93.4 +- 3.9	80.3 +- 8.1

Or, if I compute the metrics with relaxed boundaries like in your paper:	Method	R-Accuracy	R-Precision	R-Recall	R-Jaccard
BNpitfalls	93.8 +- 5.0	91.2 +- 5.8	91.4 +- 8.0	83.1 +- 9.3
DACAT	94.6 +- 3.8	92.5 +- 5.9	91.4 +- 8.5	84.6 +- 10.0

I think that one reason for the discrepancy could be which model is selected for testing. (I couldn't find the details regarding model selection in your paper)

From the code, it seems that the model that achieves best validation accuracy (checkpoint_best_acc.pth.tar) is to be selected from step 2.1, and I followed that. However, in step 2.2, where we run train_longshort.py and use PhaseModel from newly_opt_ykx.LongShortNet.model_phase_maxr_v1_maxca, the model is saved after every epoch instead of keeping the model with best validation accuracy or f1, see code. Which model am I then supposed to pick for testing? For consistency, I chose the model with best validation accuracy again. What did you do for the paper?

Regarding the best validation accuracy: I noticed that you use a datasplit called cuhk4040, which uses the first 40 videos for training and 8 videos for validation, which overlap with the training data, see code. Wouldn't it make more sense to use a validation set that is independent of the training data?

Again, congrats on your work, which shows a clear improvement over BNpitfalls. And thank you for your time to read my feedback!

Kind regards, Isabel

kk42yy commented 4 days ago

Hello Isabel,

I am glad that you are interested in our work. Now, I will attempt to address your questions.

[Q1] Reproduce the results in paper

If you wish to directly replicate the results reported in our article, you can use our pre-trained model for inference, which is available for download from here, (see 3.Infer in ReadMe).
If you are interested in retraining a DACAT model, please refer to my subsequent response.

[Q2] Dataset Split 32/8/40 and 40/0/40

In the Cholec80 dataset, there are two splits: 32/8/40 and 40/0/40. We followed the training schemes of most comparative methods by initially using the 32/8/40 split for hyperparameters tuning and observing the model's convergence trend. Subsequently, we employed the 40/0/40 split for the final model training.
During the DACAT training (Step 2.2) using the 32/8/40 split, we observed that the model typically achieved its best accuracy on the validation set around the 25 to 30 epochs. Therefore, we recommend conducting the second phase of training for a total of 30 epochs.

[Q3] Code for saving model

I'm very sorry, the part of the code we've uploaded for saving the model is used for hyperparameters tuning. Since there is no validation set, there's no need to save the model based on the best_acc nor f1.

Hope that the above answer will address your question.

Best wishes, Kaixiang

IsabelFunke commented 4 days ago

Hello Kaixiang,

Thank you very much for your quick response!

It seems to be a valid approach to simply use the model after 30 training epochs for testing. In that case, I get the following results:

Method	Accuracy	Macro Jaccard
BNpitfalls	92.6 +- 5.1	78.8 +- 9.5
DACAT	93.4 +- 3.9	80.3 +- 8.1
after 30 epochs	92.4 +- 5.1	79.2 +- 8.8

Or, if I compute the metrics with relaxed boundaries like in your paper:	Method	R-Accuracy	R-Precision	R-Recall
BNpitfalls	93.8 +- 5.0	91.2 +- 5.8	91.4 +- 8.0	83.1 +- 9.3
DACAT	94.6 +- 3.8	92.5 +- 5.9	91.4 +- 8.5	84.6 +- 10.0
after 30 epochs	93.8 +- 4.9	91.9 +- 4.0	91.9 +- 7.4	84.1 +- 8.6

So, in my training run, I was a bit unlucky and the model after 30 epochs is not the one that works best on the test set. This can also be seen in the log file that was created for my training run (I created a simple plot for quick visualization). log log.txt

I think that this can be expected, because it depends also on the state of the random number generator etc how your training will go. Therefore, in my opinion, the results of only one training run are not so meaningful and may be overly optimistic.

I think if you repeat your experiments (all in the same manner but using different random seeds) a few times, you will also observe some variability in the results that the model achieves after 30 epochs. Then, reporting the mean and variability over several experimental runs would provide a more reliable estimate of the model performance that can be expected.

Best wishes, Isabel

kk42yy commented 4 days ago

Hello Isabel,

Thank you very much for your suggestion. As you mentioned, we have conducted experiments using different random seeds and reported the better results here. From the result graph you provided, it can also be seen that your $25^{th}$ (or $26^{th}$) epoch of training had better results, which is consistent with our observations in the experiment, i.e., performs well around $25ep$ to $30ep$.

We will report more detailed experimental results, such as mean and variance, in subsequent journal articles.

Best wishes, Kaixiang

IsabelFunke commented 3 days ago

Hello Kaixiang,

It is correct that the DACAT model achieves better results on the test set after other training epochs. However, this information is irrelevant because, for evaluation purposes, we cannot know anything about the test set. Especially, we cannot know how well the model performs on the test set itself at a specific point during training. This is why we need a clear rule that is defined a priori and independently of the test set, which tells us which model we need to pick for evaluation after model training.

If we just look for the model that achieves the best test accuracy, then we can find a very good BNpitfalls model as well. To show this, I added the results from the first 100 epochs of Stage 1 (BNpitfalls) training to the previous plot (gray dots). But, clearly, this approach to model selection (using information about test performance) does not make much sense.

log

Best wishes, Isabel

kk42yy commented 3 days ago

Hello Isabel,

My answer yesterday seemed a bit confusing, now let me add the following clarification:

Firstly, I did not imply selecting the model with the highest accuracy. As previously stated, in the process of tuning using 32/8/40, after multiple experiments with multiple random seeds, I can confirm that the model with the best performance in the validation set is located in epochs 25 to 30. Your plot yesterday (results on 40 test data) also confirms my conclusion.

Finally, the difficulty in selecting models is due to the lack of validation sets. One of the ways we provide in the code is to use 33-40 videos with 30ep for model selection. You can certainly set it up yourself, or refer to other methods. But I think the way to select the model for Stage 2 and Stage 1 should be consistent.

Best wishes, Kaixiang

IsabelFunke commented 10 hours ago

Hello Kaixiang,

Thank you for your patience in answering my questions. I'm indeed a bit confused. I'm aware that there are different approaches for model selection and that it is recommended to use a separate validation set for this purpose. However, my main request is to hear how I should train and select a model that can reproduce the results from the paper. To this end, I would like to simply use the same strategy as you did.

From the paper and your previous comments, I understood that you used the 40-40 split on Cholec80, so no separate validation set. But it is still unclear to me which model I should pick for testing. If i should pick the model after a fixed number of training epochs, I would need to know the exact number, not a range of 25-30 epochs.

You mentioned that the model selection should be performed in a comparable way for Stage 1, and I agree but I'm also confused. Because for Stage 1, according to the instructions in the README, I should use checkpoint_best_acc.pth.tar, i.e., the model with best validation accuracy. If I run the code as provided, then this is the model that achieves best accuracy on a subset of the training set (videos 32-40). Is this kind of model selection intended? If not, could you please provide the details also for how to select the Stage 1 model?

Best wishes, Isabel

kk42yy commented 10 hours ago

Hello Isabel,

To reduce the difficulty of reproduction, I strongly recommend that you use our publicly available Stage 1 model and start the training of Stage 2 directly. If you use our Stage 1 model, then you just need to save the $30^{th}$ epoch model, i,e, checkpoint_current.pth.tar.

By the way, the model we have published can reproduce the results presented in the paper.

Best wishes, Kaixiang

kk42yy / DACAT