Open caifand opened 5 years ago
So the rule should be:
software_name
never includes a preceding publisher.
If there is a "instrument-like" citation following the software_name
then we code creator in there but not preceding publishers.
MS <rs type="software_name">Excel</rs> (<rs type="creator">Microsoft Corporation</rs>, Redmond, WA)
<rs type="software_name">Excel</rs> by <rs type="creator">Microsoft<rs>
Otherwise we code creator in the preceding publisher.
Calculations were made using <rs type="creator">MS</rs> <rs type="software">Excel</rs>.
We used <rs type="creator">GraphPad<rs> <rs type="software">Prism</rs>.
What do you think @kermit2?
This is indeed exactly the rules I tried to follow for having some consistency - except for the raised cases like GraphPad Prism and Lotus Notes for which the "publisher" name is so commonly attached to the actual software name that it's only after reviewing many paragraphs that I realized that the rule was not applied.
I think it makes sense however to apply the rules systematically, so having
We used <rs type="creator">GraphPad<rs> <rs type="software">Prism</rs>.
and
<rs type="creator">Lotus</rs> <rs type="software">Notes</rs>
I have to confess also a bias :)
I think I kept those few exceptions like Lotus Notes
, because I had in mind the problem of disambiguation/matching of the software mention in existing software knowledge bases. I know that after extracting all software name mention, we want to deduplicate them and match them to a software "entity".
If you look at the "labels" for the Wikidata entity for Lotus Notes, at https://www.wikidata.org/wiki/Q60198 you see that they all contain the publisher name. So having this bias helps the matching, not having Lotus
and just Notes
make the matching a bit more complicated as we need to combine different extracted fields.
Yeah, I mean the reality is that some software has the publisher in the name, and even the publisher uses that. I definitely see that. But we need some sort of consistency here, no?
On Tue, Oct 29, 2019 at 6:50 PM Patrice Lopez notifications@github.com wrote:
I have to confess also a bias :)
I think I kept those few exceptions like Lotus Notes, because I had in mind the problem of disambiguation/matching of the software mention in existing software knowledge bases. I know that after extracting all software name mention, we want to deduplicate them and match them to a software "entity".
If you look at the "labels" for the Wikidata entity for Lotus Notes, at https://www.wikidata.org/wiki/Q60198 you see that they all contain the publisher name. So having this bias helps the matching, not having Lotus and just Notes make the matching a bit more complicated as we need to combine different extracted fields.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/641?email_source=notifications&email_token=AAAWOUSC66OFLQG6B3VKYMTQRDD4JA5CNFSM4JEL3TA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSOTEQ#issuecomment-547678610, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUVST64MSE527NMGIHTQRDD4JANCNFSM4JEL3TAQ .
I am moving some existing issues into standalone posts to increase their visibility. I am also thinking whether the additional correction should be made into new rules for future annotation work.
The first one is what we've debated for some time. For
software_name
annotations like Microsoft Excel, GraphPad Prism, Lotus Notes, we've annotated the creator name inside thesoftware_name
as a separate entity in post-processing. Apart from their semantic difference and the introduced ambiguities, one big concern brought up by @kermitt2 earlier is to avoid overlapping annotations since they will become knotty in tei xml.Currently in our dataset, GraphPad Prism are usually put together in
software_name
. In some cases Microsoft is separately annotated ascreator
while the correspondingsoftware_name
is annotated as Excel; but we also have tricky examples like MS+Excel/MS Excel/Microsoft+Office Excel, etc. e.g.:Sometimes there's additional
creator
info accompanied and @kermitt2 treats them as the primarycreator
information to be annotated in the current candidate release. The first occurance of Microsoft before Excel is thus skipped in such cases. For instance:If we strictly limit only one annotated string in each annotation field, then we would want to set a rule here for establishing this priority (e.g., annotate the full organizational name and ignore the Microsoft before the software name) for future annotating. The same as the case of GraphPad Prism.
Generally speaking, I think it's reasonable to separate GraphPad from Prism and do the same thing to Microsoft/MS Excel. Perhaps IBM Notes is a hypothesized example as we only have one instance of Lotus Notes in the candidate tei xml. Even if it occurs, seems to me annotating separate entities here is better?