howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Consistency: Should we exclude `creator` from `software_name`? #641

Open caifand opened 5 years ago

caifand commented 5 years ago

I am moving some existing issues into standalone posts to increase their visibility. I am also thinking whether the additional correction should be made into new rules for future annotation work.

The first one is what we've debated for some time. For software_name annotations like Microsoft Excel, GraphPad Prism, Lotus Notes, we've annotated the creator name inside the software_name as a separate entity in post-processing. Apart from their semantic difference and the introduced ambiguities, one big concern brought up by @kermitt2 earlier is to avoid overlapping annotations since they will become knotty in tei xml.

Currently in our dataset, GraphPad Prism are usually put together in software_name. In some cases Microsoft is separately annotated as creator while the corresponding software_name is annotated as Excel; but we also have tricky examples like MS+Excel/MS Excel/Microsoft+Office Excel, etc. e.g.:

<p>All statistical analyses were performed using paired Student's t tests and <rs corresp="#PMC3025493-software-3" type="creator">Microsoft</rs> <rs type="software" xml:id="PMC3025493-software-3">Excel</rs> or <rs type="software">Prism</rs> software packages.

Calculations were made using <rs type="software">MS Excel</rs> and are presented in Appendix 1.

used to summarise the analytic outputs using <rs corresp="#PMC5435264-software-0" type="creator">MS</rs>
          <rs type="software" xml:id="PMC5435264-software-0">Excel</rs>.

Sometimes there's additional creator info accompanied and @kermitt2 treats them as the primary creator information to be annotated in the current candidate release. The first occurance of Microsoft before Excel is thus skipped in such cases. For instance:

Observed heterozygosity was estimated in Microsoft <rs type="software" xml:id="PMC4103605-software-13">Excel</rs> (<rs corresp="#PMC4103605-software-13" type="creator">Microsoft Corporation</rs>, Redmond, Washington, USA).

If we strictly limit only one annotated string in each annotation field, then we would want to set a rule here for establishing this priority (e.g., annotate the full organizational name and ignore the Microsoft before the software name) for future annotating. The same as the case of GraphPad Prism.

Generally speaking, I think it's reasonable to separate GraphPad from Prism and do the same thing to Microsoft/MS Excel. Perhaps IBM Notes is a hypothesized example as we only have one instance of Lotus Notes in the candidate tei xml. Even if it occurs, seems to me annotating separate entities here is better?

jameshowison commented 5 years ago

So the rule should be:

software_name never includes a preceding publisher.

If there is a "instrument-like" citation following the software_name then we code creator in there but not preceding publishers.

MS <rs type="software_name">Excel</rs> (<rs type="creator">Microsoft Corporation</rs>, Redmond, WA)
<rs type="software_name">Excel</rs> by <rs type="creator">Microsoft<rs>

Otherwise we code creator in the preceding publisher.

Calculations were made using <rs type="creator">MS</rs> <rs type="software">Excel</rs>.
We used <rs type="creator">GraphPad<rs> <rs type="software">Prism</rs>.
jameshowison commented 5 years ago

What do you think @kermit2?

kermitt2 commented 5 years ago

This is indeed exactly the rules I tried to follow for having some consistency - except for the raised cases like GraphPad Prism and Lotus Notes for which the "publisher" name is so commonly attached to the actual software name that it's only after reviewing many paragraphs that I realized that the rule was not applied.

I think it makes sense however to apply the rules systematically, so having

We used <rs type="creator">GraphPad<rs> <rs type="software">Prism</rs>.

and

<rs type="creator">Lotus</rs> <rs type="software">Notes</rs>
kermitt2 commented 5 years ago

I have to confess also a bias :)

I think I kept those few exceptions like Lotus Notes, because I had in mind the problem of disambiguation/matching of the software mention in existing software knowledge bases. I know that after extracting all software name mention, we want to deduplicate them and match them to a software "entity".

If you look at the "labels" for the Wikidata entity for Lotus Notes, at https://www.wikidata.org/wiki/Q60198 you see that they all contain the publisher name. So having this bias helps the matching, not having Lotus and just Notes make the matching a bit more complicated as we need to combine different extracted fields.

jameshowison commented 5 years ago

Yeah, I mean the reality is that some software has the publisher in the name, and even the publisher uses that. I definitely see that. But we need some sort of consistency here, no?

On Tue, Oct 29, 2019 at 6:50 PM Patrice Lopez notifications@github.com wrote:

I have to confess also a bias :)

I think I kept those few exceptions like Lotus Notes, because I had in mind the problem of disambiguation/matching of the software mention in existing software knowledge bases. I know that after extracting all software name mention, we want to deduplicate them and match them to a software "entity".

If you look at the "labels" for the Wikidata entity for Lotus Notes, at https://www.wikidata.org/wiki/Q60198 you see that they all contain the publisher name. So having this bias helps the matching, not having Lotus and just Notes make the matching a bit more complicated as we need to combine different extracted fields.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/641?email_source=notifications&email_token=AAAWOUSC66OFLQG6B3VKYMTQRDD4JA5CNFSM4JEL3TA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSOTEQ#issuecomment-547678610, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUVST64MSE527NMGIHTQRDD4JANCNFSM4JEL3TAQ .