Closed FrancescoCappio closed 2 years ago
Hello, this is the test results of the whole target domain (including known classes and unknown classes), and the best overall H-score is the H-score in the paper.
However, the Acc in the paper is the result only on the known classes in the target domain, which is printed during training (when you run daml.py).
Thank you very much for your answer!
I do have some other questions.
So if I understand correctly in order to replicate the paper accuracy (Acc) (which is on known classes only) I should look at the line best_test_acc1 =...
which is printed at the end of training, right?
One other question: this known classes accuracy (Acc) is computed without looking at the confidence score and so it is not the same known classes accuracy which is used to compute the H-score (called insider in validate.py)? I mean the formula for H-score is: H-score = 2(known_accunk_acc)/(known_acc+unk_acc)
But the known_acc value in the formula (insider) is obtained by selecting predictions with confidence higher than the threshold (and unk_acc is obtained by selecting predictions with confidence lower than the threshold, and is called outsider in validate.py) while for the known accuracy you put in the paper (Acc) you do not use any confidence threshold right?
Q1: We report 'best_val_test_acc1' in the code, which uses the checkpoint that achieves the highest acc on the held-out validation set, and then tests it on the target domain. It is printed in line 209 of the code and is shown as 'Mean validation acc...' in the line before 'best_val_acc1' and 'best_test_acc1' in the training log. The 'best_test_acc1' uses the checkpoint that achieves the highest acc on the target domain, which may be similar or slightly higher than 'best_val_test_acc1'.
Q2: H-score is used in situations when unknown classes exist. In this real situation, we need to identify first whether the sample is from unknown or known classes, and then classify the known classes into specific labels. And the acc of known classes here is different from that in Q1 (For example, some known class examples may be incorrectly classified as unknown because of the low confidence here).
Ook! Thank you very much, now everything is clear! I just have one last question: I noticed in validate.py that you divide the range of confidence values in 10 intervals and then you evaluate H-score with 10 threshold values. At the end you print the highest H-score. I was wondering if you really find this strategy good considering that of course you cannot apply it on unlabeled target data, don't you have any suggestion on how to choose an appropriate threshold value for unlabeled data?
Thank you for the valuable advice, my idea is that we can choose a certain percentile of the confidence on the held-out data or maybe we can use some additional data as an agent for outliers, I think it remains an open problem.
Ok thank you very much!
Hello! I am really interested in the Open Domain Generalization setting that you propose in your paper, as I think it is really useful for real world applications. I am therefore trying to replicate the results of your method following the instructions here in the repository. I trained all models on the OfficeHome Dataset and then tested them.
The output I obtain (e.g. for the shift ACP->R) is:
If I understand correctly the "best overall accuracy" is what you print in the "Acc" column of table 3 of the paper. While the "best overall H-score" is what you print in the "H-score" column. Is this correct?