This Pull Request introduces a unique document ID generation mechanism for the Elasticsearch indexing process used within the LiteRev platform. The purpose of this enhancement is to prevent data duplication during the daily indexing of new data from MedRxiv and BioRxiv servers. By ensuring each document indexed into Elasticsearch is unique, we maintain data integrity and improve the platform's overall search efficiency.
Closes #3
How to Test These Changes
To test these changes, follow the steps below:
Ensure the Elasticsearch service is running and accessible.
Execute the modified indexing script for either medrxiv or biorxiv data.
Verify that documents are indexed correctly without duplication by querying the Elasticsearch database.
Check the logs generated by the indexing script to ensure unique document IDs are generated and used during the indexing process.
Example script execution:
python scripts/index_rxivx_data.py medrxiv0
Pull Request Checklists
This PR is a:
[x] bug-fix
[ ] new feature
[ ] maintenance
About this PR:
[x] It includes tests.
[x] The tests are executed on CI.
[x] The tests generate log file(s) (path: /tmp/elasticrxivx_{index_name}_{timestamp}.log).
[x] Pre-commit hooks were executed locally.
[ ] This PR requires a project documentation update.
Author's Checklist:
[x] I have reviewed the changes and it contains no misspelling.
[x] The code is well commented, especially in the parts that contain more complexity.
[x] New and old tests passed locally.
Additional Implementation
1. Secure Password Management for Elasticsearch
Introduced a script to automatically reset and update the Elasticsearch 'elastic' user password, enhancing security by automating credential management. This script is executed as part of the container startup process, ensuring that Elasticsearch credentials are securely managed and updated as needed.
Example script execution:
containers/init-scripts/set_passwords.sh
Reviewer's Checklist
Please use the following checklist for reviewing this PR:
## Reviewer's Checklist
- [ ] I managed to reproduce the problem locally from the `main` branch.
- [ ] I managed to test the new changes locally.
- [ ] I confirm that the issues mentioned were fixed/resolved.
Pull Request Description
This Pull Request introduces a unique document ID generation mechanism for the Elasticsearch indexing process used within the LiteRev platform. The purpose of this enhancement is to prevent data duplication during the daily indexing of new data from MedRxiv and BioRxiv servers. By ensuring each document indexed into Elasticsearch is unique, we maintain data integrity and improve the platform's overall search efficiency.
Closes #3
How to Test These Changes
To test these changes, follow the steps below:
medrxiv
orbiorxiv
data.Example script execution:
Pull Request Checklists
This PR is a:
About this PR:
/tmp/elasticrxivx_{index_name}_{timestamp}.log
).Author's Checklist:
Additional Implementation
1. Secure Password Management for Elasticsearch
Introduced a script to automatically reset and update the Elasticsearch 'elastic' user password, enhancing security by automating credential management. This script is executed as part of the container startup process, ensuring that Elasticsearch credentials are securely managed and updated as needed.
Reviewer's Checklist
Please use the following checklist for reviewing this PR: