lfoppiano / supercon2

Staging-area for automatically collected experimental data for the SuperCon database with a curation interface with enhanced-document viewer and curation-ready interface
https://supercon2.readthedocs.io
4 stars 0 forks source link

Distinguish extraction problem from post-processing problem #80

Open kensei-te opened 2 years ago

kensei-te commented 2 years ago

In order to obtain neat/ready-to-use dataset for machine-learning, from text data mining, there would be two steps.

First, the item of interest has to be properly extracted. Second, it has to be properly post-processed.

During the curation process, I want to clearly distinguish extraction problem from post-processing problem. Even now every "status" or "error-type" will fall into either, but I want to clarify it.

Luca is already kindly performing several post-processing for extracted items. But the data are still not fully ready to use. I also want to discuss about, which part will be taken care by Luca, and which part might be our task.

I mean, every curated items will be divided into 3

  1. will be solved by improving extraction
  2. will be solved by post-processing method by Luca (therefore this may be provided in open-version of supercon2)
  3. will be solved by post-processing method by user (this might be Takano-Gr original) It would be great if we can distinguish them during the curation. I hope we can discuss this in coming meeting.
lfoppiano commented 2 years ago

Good point! This is one of the goals of the guidelines.

For case 1) we can define these cases as "invalid boxes", when the box miss some information or contains too many information.

Here some examples:

Input: "In the doped La Fe we noticed that..." Example 1: the extracted material is "La Fe" missing "doping" Example 2: the extracted material is "doped La Fe we noticed"

For case 2) it's a special case of 1). For example

Input: "In the doped La Fe we noticed that..." we assume that the material is correctly extracted doped La Fe.

Example 1: the post processed formula is La Fe, and this is correct Example 2: the post processed formula is La or anything else which is not correct.

For case 3) we will have to sort the post-processing by picking up information scattered in the paper. Example already discussed

1 and 3 are clear I think. 2 could be tricky because it requires the curator to know which type of post-processing are performed.