Open BerkvensNick opened 6 months ago
This issue needs to be discussed, duplicities will occur, a knowledge article will be available in both Zenodo, OpenAire and Cordis. However each of these platforms capture extra information about the resource. The information should be merged to a single set of statements about the resource. The knowledge graph will facilitate this process. In the process we will find multiple challenges, for example if a resource has different titles in different platforms. Typical behaviour is that both titles are stored.
Maybe in this first iteration we can identify/flag duplicates based on doi - title similarity - author - date and then further discuss with JRC how to tackle the duplicities, but then based on actual "duplicate sources" we have found?
The SWR will identify duplicates based on metadata. https://github.com/soilwise-he/Soilwise-userstories/issues/16
Origin: D1.3 Repository architecture
With more detailed tasks per requirement:
Define Strategy for Identification, Processing, and Storing of Duplicates for Iteration 1
[ ] Define a clear algorithm or methodology for identifying duplicates based on metadata attributes (e.g., title, author, publication date).
[ ] Ensure the strategy accounts for variations in metadata across different document types and repositories.
[ ] Outline the workflow for processing identified duplicates, including how to handle conflicting metadata and determining the primary document.
[ ] Specify whether duplicates should be merged, flagged, or excluded from search results.
[ ] Define the storage mechanism for duplicates, ensuring efficient retrieval and management within the central database.
Adapt KG Structure to Support Duplicities (Link with Knowledge Graph)
[ ] Modify the knowledge graph (KG) schema to accommodate duplicate relationships between documents.
[ ] Define how duplicate relationships will be represented within the KG (e.g., as edges linking duplicate nodes).
[ ] Ensure KG queries can retrieve duplicate-related information, allowing users to explore connections between duplicate documents.
[ ] Test KG queries to verify accurate retrieval of duplicate metadata and relationships.
Processing in Metadata Store
[ ] Ensure metadata extraction is accurate and robust across different document formats and languages.
[ ] Enrich metadata with additional attributes that facilitate duplicate identification (e.g., normalized titles, standardized author names).
[ ] Identify duplicates using the processing capabilities of the SWR metadata store.
Identify Duplicates Based on Metadata
[ ] Implement a duplicate detection algorithm based on metadata similarity metrics (e.g., Jaccard similarity, Levenshtein distance).
[ ] Test the algorithm's performance on a diverse dataset to evaluate its accuracy and efficiency.
[ ] Define thresholds for similarity scores or metadata attributes to classify documents as duplicates.
[ ] Adjust thresholds based on the desired balance between precision and recall in duplicate identification.
Visualization of Duplicates in User Interface (Link with UI Component)
[ ] Design intuitive visualizations within the user interface (UI) to represent duplicate relationships.
[ ] Ensure visualizations are accessible and informative for users of varying expertise levels.
[ ] Implement interactive features that allow users to explore duplicate relationships (e.g., clicking on a document to view its duplicates).