proycon / codemetapy

A Python package for generating and working with codemeta
https://codemeta.github.io/
GNU General Public License v3.0
24 stars 5 forks source link

codemetapy fails to merge triples for the same person #43

Closed apirogov closed 11 months ago

apirogov commented 1 year ago

File in1.json:

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "author": [
    {
      "@id": "https://orcid.org/0000-1234-5678-9101",
      "@type": "Person",
      "familyName": "Doe",
      "givenName": "John"
    }
  ],
  "codeRepository": "https://github.com/example/repository",
  "description": "an example",
  "name": "example",
  "version": "0.1.0"
}

File in2.json:

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "author": [
    {
      "email": "john.doe@example.com",
      "@type": "Person",
      "familyName": "Doe",
      "givenName": "John"
    }
  ],
  "codeRepository": "https://github.com/example/repository",
  "description": "an example",
  "name": "example",
  "version": "0.1.0"
}

Run codemetapy in1.json in2.json

Expected result:

Person will have both email and orcid

Actual result:

Person has only email (when passed in this order) or only orcid (when passing in2.json before in1.json)

broeder-j commented 1 year ago

@apirogov: The current compose in codemetapy is a simple overwrite on the triple level and triples for which are not in the new graph are removed than there is an rdf merge. There is no entity resolution implemented in codemetapy, but this is also stated in the readme.

I can image that one can do better.

A simple rdf merge could already be better (in some cases), but would not be enough, since it only works for objects with identifiers in both graphs. But it would at least merge the email if the second person also has an orcid as identifier, due to a usual rdf merge, please check if this is the case. I am not sure how blank nodes are handled in detail in codemetapy.

proycon commented 11 months ago

The current compose in codemetapy is a simple overwrite on the triple level and triples for which are not in the new graph are removed than there is an rdf merge. There is no entity resolution implemented in codemetapy, but this is also stated in the readme.

Correct, it overwrites the entire triple. This behaviour is by design so you can compose a codemeta file from multiple input files, where the ordering determines which takes priority. This behaviour is used by codemeta-harvester.

A simple rdf merge could already be better (in some cases), but would not be enough, since it only works for objects with identifiers in both graphs.

Yes. If you want a merge, the only way to do so currently is to ensure the authors have the same @id. So if everything already has ORCIDs it'll work fine. I realize it's sub-optimal and some better mechanism could be implemented

However, merging multiple instances of persons is more tricky than it might seem. Names are not always consistent (an extra middle name, a missing diacritic, etc). Then which do you choose? We definitely don't want to end up with multiple givenName and familyName properties. Multiple emails or urls may be ok.

Another challenge is when having a graph of multiple SoftwareSourceCode instances (which codemetapy supports) where an author appears in multiple projects; but what if he/she has different affiliations in such a context?

proycon commented 11 months ago

Closing as 'invalid' since it's not a bug but by design. But of course the question and discussion itself (feel free to continue here) is very valid, and a better solution may be devised.