what is a duplicate? - Githubissues

BerkvensNick commented 2 months ago

Duplicates can be classified through:

equal vs different (a hash can be used to compare the harvested content with the currently stored content)

Content differences can be classified as

different content in the same properties (conflict)
additional content in alternate properties (eg. cordis stores which datasets are produced in which project, openaire stores the title, abstract of the resource) (extension)

Differences between duplicates can be explained by

different moment of harvesting from source
2 different records sharing the same identifier
alterations to the harvested content by the source platform

In case of conflict it is important to go back to the point-of-truth, the originating platform, to capture the latest situation. In case of conflicting statements, it is important to store both statements.

[ ] the same identifier occurring in the knowledge graph but being harvested from 2 or more different repositories; this can probably be flagged with a smart SPARQL query and flagged with a label or so in the graph
[ ] the same identifier occurring in the knowledge graph but being harvested from 2 or more different repositories; but with different content in the same fields (a duplicate of type conflict) in other fields (a duplicate of type extension)
[ ] different identifier but having similar datafield content (File Name, File Size, File Type, Owner, Description, Date Created/Modified), maybe start by identifying items with the same file name or title?; these items will probably have to be flagged by NLP techniques (see paragraph "Duplicates identification" in https://main.soilwise-documentation.pages.dev/technical_components/interlinker/)

Definition of Done

[ ] Description of what we propose to consider duplicates
[ ] Subset of situations that we want to include for the duplicate checker in iteration 1
[ ] Strategy for implementation in iteration 1
[ ] Issues for iteration 1 defined and added to spriint planning

roblokers commented 1 month ago

Additionally, we need to decide on how we implement a "minimal implementation" for the 1st iteration demonstration

pvgenuchten commented 1 month ago

For the first iteration, we focus on academic resources. The risk of conflicts is low, due to proper identification with DOI. I expect more conflicts when we harvest government sources from inspire / open data.

BerkvensNick commented 1 month ago

discuss in this sprint 3 with relational database in place with new setup, implementation in sprint 4

soilwise-he / similarity-finder

what is a duplicate? #4