Open kermitt2 opened 5 years ago
Regarding the model
example, I personally believe GTAP6inGAMS
is actually a software package. There does exist a model called GTAP6
; GTAP6inGAMS
is the implementation of the GTAP6
model in the GAMS
environment... Here's their package web page
But truly, model is often a boundary case of software instantiation. And you need to do a lot of background research to get it accurate. And perhaps carefully read the context to figure out in EACH sentence it denotes the abstract numeric/statistical model or the implementation in code...
Seems to me that we can keep annotating operating systems in the way that you handled them in the corpus file. i.e., to make it an explicit rule that we only annotate OS when it is mentioned as an individual case.
Another question is would we want to keep annotating non-named software/scripts? Currently they are not consistently annotated in the dataset. When annotating those in the future we can leave the software_name
of these instances as empty but annotate it as a software mention.
I left non-named software/scripts not annotated because we can't match them with existing software database, we can't deduplicate them over mentions in several documents and people can't cite them... so I think it's a bit impossible to exploit them and there is no possible credit to be given for such a non-named software.
We describe here the additional principles used to make the annotations more uniform in the release candidate.
We merged
version-number
andversion-date
into a singleversion
type.version-date
was actually very rare as compared to all the other types.The annotations of
version
only contain the version number, name or date, without any other token like "version", "v.", etc. or punctuations. The reasons are 1) existing annotations were not always including or excluding additional "version", "v.", ..., 2) it's easier to remove automatically extra tokens than adding them, 3) for using the annotations, it's easier just to have the number (in particular for matching a software version in the KB) - it avoids writing extra string processing to remove them.creator
annotations does not include the address part, it contains only the name of the organization and if present the types of the business entity (inc., GmbH, etc.). The reasons for this guideline are similar as above forversion
. An additional annotation likecreator-address
could be added and probably be done largely automatically, though a final manual checking will be necessary.Programming languages as such are not annotated. The programming languages were usually already not annotated, but not in every cases, so we generalize the exclusion to have something uniform. However we annotate the mention when used as implemented "framework" (R environment, C++ compiler, ...).
we always keep acronym with the software name annotation:
This annotation pattern is now systematically used (it was the most frequent pattern already in the existing annotations, but not followed in around one third of the cases).
for instance in 10.1111/j.1467-8381.2009.02008.x
was corrected:
The software that uses the model is called GAMS.
The reasons are non-uniform annotations and facilitating further disambiguation/matching of the software package.
We considered that it's interesting to detect them, even if they won't be matched against an existing software in the populated knowledge base.
for instance in PMC4927739, cathodoluminescence (CL) is a physical reaction used as a measurement method. CL is wrongly annotated as a software in the following:
The software is not named, it controls the cathodoluminescence in order to make a measurement.
Same for cross-correlation electron backscatter diffraction (ccEBSD), wrongly annotated here:
The software is not named EBSD. There is a software mention ("control software"), but it remains unnamed.
Some wrong annotation of methods as software are obvious (PMC1933234):
However, exhaustiveness was hard to achieve in a reasonable time and there are still some databases and ambiguous cases annotated as software in the current version.
This annotation pattern is normally used systematically now (previously for these cases, the annotations of one "Microsoft" or the other was very random).