Closed FabianIsensee closed 5 years ago
I understand that teams may want to be sure that they don't have an unexpected drop in performance due to a bug. But I think reporting a noisy value of the dice score will still accomplish that and at the same time prevent too much fine tuning
No apology necessary, I agree completely that we need to ensure the fidelity of the test set scores and avoid tuning.
First of all, we are obfuscating the true value. In the announcement of this rule change, we state
we will be allowing teams to see the approximate scores of at most two submissions prior to the deadline. By approximate, I mean the aggregate score for a randomly selected 45 of the test cases.
Second, our requirement that all submissions be associated with a manuscript will serve to deter teams from submitting under more than one name. We are reviewing these manuscripts as submissions come in, and we are watching for substantially similar submissions and overlapping author lists.
BraTS has validation set for tuning (teams get feelbacks), but uses test set for ranking (allow only one submission). That's good.
No apology necessary, I agree completely that we need to ensure the fidelity of the test set scores and avoid tuning.
First of all, we are obfuscating the true value. In the announcement of this rule change, we state
we will be allowing teams to see the approximate scores of at most two submissions prior to the deadline. By approximate, I mean the aggregate score for a randomly selected 45 of the test cases.
Second, our requirement that all submissions be associated with a manuscript will serve to deter teams from submitting under more than one name. We are reviewing these manuscripts as submissions come in, and we are watching for substantially similar submissions and overlapping author lists.
Ah thank you for clarifying. I did not read the announcement, just the statement on the submission page. Also your point about the manuscripts is completely valid. So yeah, everything is great. Apologies :-)
@shawnyuen I am not convinced that this is the best way to go. The validation set in BraTS is not a very good use of annotated data (in my opinion of course). If you need to validate your approach internally, run a cross-validation. Or do a single train val split yourself. Yes the validation set in BraTS is the same for all candidates, so scores are comparable. But nobody really cares about the validation scores, everything that counts is done on the test set. I am not even sure how stable the validation score is. Is the team that has the highest validation score also the best team on the test set? Last year that was pretty much the case, but still the rankings further down may be less stable. So, TLDR (and imo), the validation set is not super useful. If you have more annotated data, they are better used as additional training or test cases.
Best, Fabian
@neheller could you add the information about the approximate nature of the feedback to the Submission page as well? As far as I can see it is only in the Forum and I am not sure if everybody reads all posts there (I at least didn't :-) )
45 cases is still a good proportion of the test set and I would expect scores to be quite representative. Personally, I would have preferred fewer cases (like 25 or so) but this is a good compromise.
Yes, I've changed the wording to what follows. Thanks for pointing this out :)
Submissions have been enabled for the 2019 challenge. Within 24 hours of submitting, you will receive an email prompting whether or not you would like to hear your approximate score. Scores for each team will provided only twice, but you may keep submitting after receiving two scores. The most recent submission prior to the deadline will be the one used for the competition.
Scores reported by this mechanism are approximate in that they are calculated using only on a randomly selected 45 of the test cases. This random sampling is repeated with each submission, so you are unlikely to be evaluated on the same 45 as a prior score report or that of another given team.
@shawnyuen, when first designing the challenge, we discussed the idea of a validation set internally, but ultimately came to the same conclusions as @FabianIsensee just mentioned.
You're probably right that the same result could have been achieved with a lower proportion, and 45 was chosen somewhat arbitrarily. I guess I went with a higher number since the difficulty of each case varies considerably, and it's conceivable that 25 really difficult cases might be randomly selected and give an alarmingly low tumor Dice.
Thank you so much. And .. once again props for putting this challenge together. It seems that you have thought of everything. This is really one of the best organized challenges so far and I very much appreciate how quickly you are responding :-) I hope to meet you at MICCAI
Erm... Now that is something I don't like to see. Can you please at least add Gaussian Noise (or any other type of noise that prevents us from knowing the exact Dice scores) to the reported values? Anything that obfuscates +- 1 dice point would be greatly appreciated.
I apologize if I appear a little rude, but optimizations on the test set really should be avoided. Someone could create more than one team and get plenty of feedback.
Best, Fabian