Open kermitt2 opened 5 years ago
Hmmm, yes I see the issue here. Certainly the dates on bibliographical reference callout should not be coded as version-date. So we could programmatically remove any version-date codes that overlap with a bibliographical reference callout (assuming that TEI XML marks the call outs in some way?)
OTOH, it is appropriate to code the names in a bibliographical reference callout as creator (on the logic that having their names there gives them credit for the code). So that seems harder to fix. Perhaps programmatically, though, we could identify any bibliographical reference callout in a sentence with a mention (ie a software name). Then we could either a) auto-code any names with those as creator, or b) manually review to know which are actually the creators of the software.
For creator, though, I deeply suspect that without looking at the article cited specifically, a great many of the names in bibliographical reference callouts would not have been marked as creators. ie the information used to disambiguate whether any particular bibliographical reference callout with a name is outside the information given to the machine learning system (because we're not actually giving the text of the cited paper which the coders read/skimmed to make their decision). So that agues for dropping any creator tags in a bibliographical reference callout.
In terms of what we need for CiteAs etc, I think the creator is low priority (since we can work via the Grobid recognized bibliographical reference callout), so I think best route to consistency is to drop the creator tags in a bibliographical reference callout.
Annotations of creator within a bibliographical reference callout seems random:
10.1002%2Fpam.22030.software-mention.xml
10.1007%2Fs00191-010-0188-y.software-mention.xml
10.1007%2Fs10290-016-0264-y.software-mention.xml
10.1007%2Fs10663-015-9287-1.software-mention.xml
versus
10.1007%2Fs10683-017-9548-x.software-mention.xml
-> tagging version date using the publication year of the introduced bibliographical reference is not very well-funded imho.
10.1007%2Fs10258-013-0091-1.software-mention.xml
10.1007%2Fs11166-011-9127-z.software-mention.xml