4. identify duplicities

BerkvensNick commented 1 month ago

Thanks to a Interlinker component, that will be powered by SoilWise metadata store, the SWR will identify duplicities in data based on metadata. https://github.com/soilwise-he/Soilwise-userstories/issues/16

[ ] define strategy for identification, processing and storing of duplicates for iteration 1
[ ] processing in metadata store
[ ] identify duplicates based on metadata
[ ] adapt KG structure to support duplicities (link with knowledge graph)
[ ] visualisation of duplicates in user interface (link with UI component)

With more detailed tasks per requirement:

Define Strategy for Identification, Processing, and Storing of Duplicates for Iteration 1
[ ] Define a clear algorithm or methodology for identifying duplicates based on metadata attributes (e.g., title, author, publication date).
[ ] Ensure the strategy accounts for variations in metadata across different document types and repositories.
[ ] Outline the workflow for processing identified duplicates, including how to handle conflicting metadata and determining the primary document.
[ ] Specify whether duplicates should be merged, flagged, or excluded from search results.
[ ] Define the storage mechanism for duplicates, ensuring efficient retrieval and management within the central database.
Adapt KG Structure to Support Duplicities (Link with Knowledge Graph)
[ ] Modify the knowledge graph (KG) schema to accommodate duplicate relationships between documents.
[ ] Define how duplicate relationships will be represented within the KG (e.g., as edges linking duplicate nodes).
[ ] Ensure KG queries can retrieve duplicate-related information, allowing users to explore connections between duplicate documents.
[ ] Test KG queries to verify accurate retrieval of duplicate metadata and relationships.
Processing in Metadata Store
[ ] Ensure metadata extraction is accurate and robust across different document formats and languages.
[ ] Enrich metadata with additional attributes that facilitate duplicate identification (e.g., normalized titles, standardized author names).
[ ] Identify duplicates using the processing capabilities of the SWR metadata store.
Identify Duplicates Based on Metadata
[ ] Implement a duplicate detection algorithm based on metadata similarity metrics (e.g., Jaccard similarity, Levenshtein distance).
[ ] Test the algorithm's performance on a diverse dataset to evaluate its accuracy and efficiency.
[ ] Define thresholds for similarity scores or metadata attributes to classify documents as duplicates.
[ ] Adjust thresholds based on the desired balance between precision and recall in duplicate identification.
Visualization of Duplicates in User Interface (Link with UI Component)
[ ] Design intuitive visualizations within the user interface (UI) to represent duplicate relationships.
[ ] Ensure visualizations are accessible and informative for users of varying expertise levels.
[ ] Implement interactive features that allow users to explore duplicate relationships (e.g., clicking on a document to view its duplicates).

pvgenuchten commented 1 month ago

This issue needs to be discussed, duplicities will occur, a knowledge article will be available in both Zenodo, OpenAire and Cordis. However each of these platforms capture extra information about the resource. The information should be merged to a single set of statements about the resource. The knowledge graph will facilitate this process. In the process we will find multiple challenges, for example if a resource has different titles in different platforms. Typical behaviour is that both titles are stored.

BerkvensNick commented 1 month ago

Maybe in this first iteration we can identify/flag duplicates based on doi - title similarity - author - date and then further discuss with JRC how to tackle the duplicities, but then based on actual "duplicate sources" we have found?

soilwise-he / Soilwise-Project-Backlog

4. identify duplicities #4