I was looking at your code and attempting to recreate your results.
If this this is how the results quoted in the paper were obtained it seems a bit strange that you are fine-tuning your threshold on the test set. Not withstanding the fact that the threshold is tuned per dataset in the benchmark (this being mentioned in the paper).
I was looking at your code and attempting to recreate your results.
If this this is how the results quoted in the paper were obtained it seems a bit strange that you are fine-tuning your threshold on the test set. Not withstanding the fact that the threshold is tuned per dataset in the benchmark (this being mentioned in the paper).