Open kensei-te opened 2 years ago
Good point! This is one of the goals of the guidelines.
For case 1) we can define these cases as "invalid boxes", when the box miss some information or contains too many information.
Here some examples:
Input: "In the doped La Fe we noticed that..." Example 1: the extracted material is "La Fe" missing "doping" Example 2: the extracted material is "doped La Fe we noticed"
For case 2) it's a special case of 1). For example
Input: "In the doped La Fe we noticed that..." we assume that the material is correctly extracted doped La Fe
.
Example 1: the post processed formula is La Fe
, and this is correct
Example 2: the post processed formula is La
or anything else which is not correct.
For case 3) we will have to sort the post-processing by picking up information scattered in the paper. Example already discussed
1 and 3 are clear I think. 2 could be tricky because it requires the curator to know which type of post-processing are performed.
In order to obtain neat/ready-to-use dataset for machine-learning, from text data mining, there would be two steps.
First, the item of interest has to be properly extracted. Second, it has to be properly post-processed.
During the curation process, I want to clearly distinguish extraction problem from post-processing problem. Even now every "status" or "error-type" will fall into either, but I want to clarify it.
Luca is already kindly performing several post-processing for extracted items. But the data are still not fully ready to use. I also want to discuss about, which part will be taken care by Luca, and which part might be our task.
I mean, every curated items will be divided into 3