thegraphnetwork-literev / es-journals

BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Enhance Elasticsearch indexing to exclude duplicate data entries #4

Closed esloch closed 6 months ago

esloch commented 6 months ago

Pull Request Description

This pull request introduces enhancements to the Elasticsearch indexing process by implementing a mechanism to exclude duplicate data entries. This improvement aims to ensure that only the latest data is indexed, preventing duplication and maintaining the integrity and efficiency of the Elasticsearch database. The changes are particularly relevant for daily data updates from sources like BioRxiv and MedRxiv, where only new entries should be added to the index.

How to Test These Changes

To test these changes, follow these steps:

  1. Ensure that the Elasticsearch service is running.
  2. Run the data fetch script to download the latest data from BioRxiv or MedRxiv.
  3. Execute the indexing script with the enhanced logic to index only new data.
  4. Verify in the Elasticsearch index that only new entries are added, and there are no duplicates of previously indexed data.
    bash scripts/run_index_arxiv_to_es.sh biorxiv
    bash scripts/run_index_arxiv_to_es.sh medrxiv

    Pull Request Checklists

This PR is a:

About this PR:

Author's Checklist:

Additional Information

This enhancement addresses the issue of data duplication in the Elasticsearch index by implementing a unique document ID generation based on the document's content. This ID is used to check if an entry already exists in the index before attempting to add it, thereby preventing duplicate entries.

Reviewer's Checklist

## Reviewer's Checklist

- [ ] I managed to reproduce the problem locally from the `main` branch.
- [ ] I managed to test the new changes locally.
- [ ] I confirm that the issues mentioned were fixed/resolved.