Open BerkvensNick opened 2 months ago
Additionally, we need to decide on how we implement a "minimal implementation" for the 1st iteration demonstration
For the first iteration, we focus on academic resources. The risk of conflicts is low, due to proper identification with DOI. I expect more conflicts when we harvest government sources from inspire / open data.
discuss in this sprint 3 with relational database in place with new setup, implementation in sprint 4
Duplicates can be classified through:
Content differences can be classified as
Differences between duplicates can be explained by
In case of conflict it is important to go back to the point-of-truth, the originating platform, to capture the latest situation. In case of conflicting statements, it is important to store both statements.
[ ] the same identifier occurring in the knowledge graph but being harvested from 2 or more different repositories; this can probably be flagged with a smart SPARQL query and flagged with a label or so in the graph
[ ] the same identifier occurring in the knowledge graph but being harvested from 2 or more different repositories; but with different content in the same fields (a duplicate of type conflict) in other fields (a duplicate of type extension)
[ ] different identifier but having similar datafield content (File Name, File Size, File Type, Owner, Description, Date Created/Modified), maybe start by identifying items with the same file name or title?; these items will probably have to be flagged by NLP techniques (see paragraph "Duplicates identification" in https://main.soilwise-documentation.pages.dev/technical_components/interlinker/)
Definition of Done