We have detected the presence of duplicate entities in OpenCitations Meta across several entity types, specifically concerning Bibliographic Resources, Responsible Agents, and Identifiers. Below is a summary of the problem and examples illustrating the issue.
See also issue https://github.com/opencitations/oc_meta/issues/24.
Summary of the Problem:
Bibliographic Resources (and related Identifier entities): There are instances where multiple Bibliographic Resource (BR) entities are linked to the same identifier value (e.g., DOI, ISSN). This effectively consists in duplication, with separate journal articles being represented by distinct entities but associated with the same DOI. This issue arises due to either:
Multiple Identifier entities connected to the same value (i.e. also the Identifier entities are duplicates).
A single Identifier entity being linked to multiple Bibliographic Resource entities.
Responsible Agents: Similar duplication occurs with Responsible Agents, such as authors. In some cases, multiple entities represent the same real-world individual but are all associated with the same ORCID identifier.
Example SPARQL Query for ISSN Duplication:
A SPARQL query was written to retrieve examples of duplicate Bibliographic Resource entities connected to the same ISSN. This query identifies 10 distinct cases where two Bibliographic Resources are linked to the same Identifier entity through the same ISSN:
PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>
SELECT DISTINCT ?id (?lit AS ?ISSN) ?br1 ?br2
WHERE {
?id datacite:usesIdentifierScheme datacite:issn;
literal:hasLiteralValue ?lit.
?br1 datacite:hasIdentifier ?id.
?br2 datacite:hasIdentifier ?id.
FILTER(?br1 != ?br2)
}
GROUP BY ?lit
LIMIT 10
Issue Description:
We have detected the presence of duplicate entities in OpenCitations Meta across several entity types, specifically concerning Bibliographic Resources, Responsible Agents, and Identifiers. Below is a summary of the problem and examples illustrating the issue. See also issue https://github.com/opencitations/oc_meta/issues/24.
Summary of the Problem:
Bibliographic Resources (and related Identifier entities): There are instances where multiple Bibliographic Resource (BR) entities are linked to the same identifier value (e.g., DOI, ISSN). This effectively consists in duplication, with separate journal articles being represented by distinct entities but associated with the same DOI. This issue arises due to either:
Responsible Agents: Similar duplication occurs with Responsible Agents, such as authors. In some cases, multiple entities represent the same real-world individual but are all associated with the same ORCID identifier.
Example SPARQL Query for ISSN Duplication:
A SPARQL query was written to retrieve examples of duplicate Bibliographic Resource entities connected to the same ISSN. This query identifies 10 distinct cases where two Bibliographic Resources are linked to the same Identifier entity through the same ISSN:
Current Results:
As part of a preliminary analysis, the scale of the problem was quantified:
The following SPARQL query was used to obtain the count of affected bibliographic resources: