opencitations / oc_meta

ISC License
8 stars 5 forks source link

Duplicate Entities Detected in OpenCitations Meta #28

Open eliarizzetto opened 2 months ago

eliarizzetto commented 2 months ago

Issue Description:

We have detected the presence of duplicate entities in OpenCitations Meta across several entity types, specifically concerning Bibliographic Resources, Responsible Agents, and Identifiers. Below is a summary of the problem and examples illustrating the issue. See also issue https://github.com/opencitations/oc_meta/issues/24.

Summary of the Problem:

  1. Bibliographic Resources (and related Identifier entities): There are instances where multiple Bibliographic Resource (BR) entities are linked to the same identifier value (e.g., DOI, ISSN). This effectively consists in duplication, with separate journal articles being represented by distinct entities but associated with the same DOI. This issue arises due to either:

    • Multiple Identifier entities connected to the same value (i.e. also the Identifier entities are duplicates).
    • A single Identifier entity being linked to multiple Bibliographic Resource entities.
  2. Responsible Agents: Similar duplication occurs with Responsible Agents, such as authors. In some cases, multiple entities represent the same real-world individual but are all associated with the same ORCID identifier.

Example SPARQL Query for ISSN Duplication:

A SPARQL query was written to retrieve examples of duplicate Bibliographic Resource entities connected to the same ISSN. This query identifies 10 distinct cases where two Bibliographic Resources are linked to the same Identifier entity through the same ISSN:

PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>

SELECT DISTINCT ?id (?lit AS ?ISSN) ?br1 ?br2
WHERE {
  ?id datacite:usesIdentifierScheme datacite:issn;
    literal:hasLiteralValue ?lit.

  ?br1 datacite:hasIdentifier ?id.
  ?br2 datacite:hasIdentifier ?id.

  FILTER(?br1 != ?br2)
}
GROUP BY ?lit
LIMIT 10

Current Results:

id ISSN br1 br2
https://w3id.org/oc/meta/id/06302944976 2214-1766 https://w3id.org/oc/meta/br/06303150256 https://w3id.org/oc/meta/br/06380151022
https://w3id.org/oc/meta/id/0616014 0162-8828 https://w3id.org/oc/meta/br/062503701865 https://w3id.org/oc/meta/br/062203701946
https://w3id.org/oc/meta/id/0616014 0162-8828 https://w3id.org/oc/meta/br/0603903711 https://w3id.org/oc/meta/br/062503701865
https://w3id.org/oc/meta/id/06170244 1178-203X https://w3id.org/oc/meta/br/062103762230 https://w3id.org/oc/meta/br/06280185247
https://w3id.org/oc/meta/id/0616081 1555-6654 https://w3id.org/oc/meta/br/061203826853 https://w3id.org/oc/meta/br/061606048
https://w3id.org/oc/meta/id/061402866970 1809-9246 https://w3id.org/oc/meta/br/061203801536 https://w3id.org/oc/meta/br/061403009914
https://w3id.org/oc/meta/id/06201832116 2007-865X https://w3id.org/oc/meta/br/06203959225 https://w3id.org/oc/meta/br/06103883902
https://w3id.org/oc/meta/id/061401171 1229-5949 https://w3id.org/oc/meta/br/0614039607 https://w3id.org/oc/meta/br/061503913707
https://w3id.org/oc/meta/id/06301140758 2212-5043 https://w3id.org/oc/meta/br/06301094885 https://w3id.org/oc/meta/br/06804190681
https://w3id.org/oc/meta/id/0626020980 0873-2159 https://w3id.org/oc/meta/br/062503758069 https://w3id.org/oc/meta/br/0603903778

As part of a preliminary analysis, the scale of the problem was quantified:

The following SPARQL query was used to obtain the count of affected bibliographic resources:

PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>

SELECT (COUNT(DISTINCT ?br) AS ?count) WHERE {
  {
    SELECT ?br (COUNT(DISTINCT ?br_other) AS ?shared_br_count) WHERE {
      ?br datacite:hasIdentifier ?id.
      ?id datacite:usesIdentifierScheme datacite:issn; # change with datacite:doi for DOIs
        literal:hasLiteralValue ?lit.
      ?br_other datacite:hasIdentifier ?id_other.
      ?id_other datacite:usesIdentifierScheme datacite:issn; # change with datacite:doi for DOIs
        literal:hasLiteralValue ?lit.
      FILTER(?br != ?br_other)
    }
    GROUP BY ?br
  }
  FILTER(?shared_br_count > 0 )
}