This pull request introduces enhancements to the Elasticsearch indexing process by implementing a mechanism to exclude duplicate data entries. This improvement aims to ensure that only the latest data is indexed, preventing duplication and maintaining the integrity and efficiency of the Elasticsearch database. The changes are particularly relevant for daily data updates from sources like BioRxiv and MedRxiv, where only new entries should be added to the index.
How to Test These Changes
To test these changes, follow these steps:
Ensure that the Elasticsearch service is running.
Run the data fetch script to download the latest data from BioRxiv or MedRxiv.
Execute the indexing script with the enhanced logic to index only new data.
Verify in the Elasticsearch index that only new entries are added, and there are no duplicates of previously indexed data.
[x] The tests generate log file(s) (path: /tmp/elasticrxivx_{index_name}_{timestamp}.log).
[x] Pre-commit hooks were executed locally.
[x] This PR requires a project documentation update.
Author's Checklist:
[ ] I have reviewed the changes and it contains no misspellings.
[ ] The code is well commented, especially in the parts that contain more complexity.
[ ] New and old tests passed locally.
Additional Information
This enhancement addresses the issue of data duplication in the Elasticsearch index by implementing a unique document ID generation based on the document's content. This ID is used to check if an entry already exists in the index before attempting to add it, thereby preventing duplicate entries.
Reviewer's Checklist
## Reviewer's Checklist
- [ ] I managed to reproduce the problem locally from the `main` branch.
- [ ] I managed to test the new changes locally.
- [ ] I confirm that the issues mentioned were fixed/resolved.
Pull Request Description
This pull request introduces enhancements to the Elasticsearch indexing process by implementing a mechanism to exclude duplicate data entries. This improvement aims to ensure that only the latest data is indexed, preventing duplication and maintaining the integrity and efficiency of the Elasticsearch database. The changes are particularly relevant for daily data updates from sources like BioRxiv and MedRxiv, where only new entries should be added to the index.
How to Test These Changes
To test these changes, follow these steps:
Pull Request Checklists
This PR is a:
About this PR:
/tmp/elasticrxivx_{index_name}_{timestamp}.log
).Author's Checklist:
Additional Information
This enhancement addresses the issue of data duplication in the Elasticsearch index by implementing a unique document ID generation based on the document's content. This ID is used to check if an entry already exists in the index before attempting to add it, thereby preventing duplicate entries.
Reviewer's Checklist