The NLP Sandbox decomposed the PHI annotation task into smaller, modular tasks like the date annotation task, person name annotation task, etc. One of the motivation is to enable tool developers to identify where their time would be best invested by looking at the leaderboard of each task. For example, if there is multiple solutions with a near perfect score for the date annotation task but no satisfying solution yet for the person name annotation task, this will indicates to the developer that their time would be best spent working on a new solution for the person name annotation task.
Instead of visiting all the leaderboards in order to obtain this information, we could compile this information in a small dashboard made of tiles (square or rectangle), one for each task. Each tile should include the following information:
The name of the task
The score of the best submission submitted to this task
Color the background of the tile based on the performance.
Example: red (min score) to green (max score)
Therefore, just by looking at the color of the tiles, one would be able to identify the challenging tasks for which no satisfying solution has been submitted yet (red-orange colors).
@andrewelamb We can discuss offline how to get the above information if needed.
Note that for each of the current task, we report a score for two datasets, so we can either:
Select one dataset that we consider as representative of the other
Average the score obtained on the two datasets (ideally weighted average)
Also since we report more than one performance metric, we would need to select one.
And to complicate further, we have task like the Location annotation task that report scores for two variants of the task. :)
Prototype
For now, consider only the performance for the i2b2 dataset
For now, consider only the tasks that have only one variant: date annotation and person name annotation
Goal
The NLP Sandbox decomposed the PHI annotation task into smaller, modular tasks like the date annotation task, person name annotation task, etc. One of the motivation is to enable tool developers to identify where their time would be best invested by looking at the leaderboard of each task. For example, if there is multiple solutions with a near perfect score for the date annotation task but no satisfying solution yet for the person name annotation task, this will indicates to the developer that their time would be best spent working on a new solution for the person name annotation task.
Instead of visiting all the leaderboards in order to obtain this information, we could compile this information in a small dashboard made of tiles (square or rectangle), one for each task. Each tile should include the following information:
Therefore, just by looking at the color of the tiles, one would be able to identify the challenging tasks for which no satisfying solution has been submitted yet (red-orange colors).
@andrewelamb We can discuss offline how to get the above information if needed.
Note that for each of the current task, we report a score for two datasets, so we can either:
Also since we report more than one performance metric, we would need to select one.
And to complicate further, we have task like the Location annotation task that report scores for two variants of the task. :)
Prototype