esloch commented 9 months ago

Pull Request Description:

This pull request implements CRON tasks to automate Elasticsearch indexing for the LiteRev platform. The tasks are scheduled to run at specific times daily, fetching data from MedRxiv and BioRxiv servers and then indexing it into Elasticsearch. Additionally, NGINX deployment services are introduced as Docker containers. Furthermore, automation for updating cronjobs for the devops user has been added to streamline operational workflows.

Changes made in this pull request:

1. Implemented scripts to fetch data from MedRxiv and BioRxiv servers and merge it into the Elasticsearch database.

Download Data Script (fetch_rxiv_data.sh):
- Fetches the latest papers from BioRxiv or MedRxiv servers.
- Generates a unique filename based on the server, latest date, and current date.
- Executes the def merge_json_files().
Merge Data Script (merge_rxiv_data.py):
- Merges new JSON data into the final database avoiding duplicates based on the title and version fields.
Elasticsearch Indexing Runner Script (run_index_rxiv_to_es.sh):
- Executes the indexing process for a specified database (BioRxiv or MedRxiv) using index_json_data().
- Logs the indexing process and errors.
Elasticsearch Indexing Script (index_rxiv_to_es.py):
- Indexes JSON database into Elasticsearch.
- Logs indexing process and errors.

2. Added NGINX docker services to enhance the application deployment process.

NGINX Service Configuration:
- NGINX service is configured to build from the specified Dockerfile and run on ports 80 and 443.
- It mounts volumes for NGINX configuration and Certbot's configuration and certificates.
Certbot Service Configuration:
- Certbot service is configured to use the certbot/certbot image.
- It mounts volumes for Certbot's configuration and certificates.
- The entrypoint script ensures automatic certificate renewal using Certbot's renew command.
- The DNS es.literev.com has been externalized with certificates using the Certbot service.

3. Automated CRON configuration and updates for devops user to schedule tasks for fetching and indexing data from MedRxiv and BioRxiv servers.

Automation: Implemented .makim target setup-cron to automatically update cronjobs for the devops user.
- Cron Tabs:
Task 1 (02:00): Fetch data from MedRxiv server and log the output.
Task 2 (02:30): Fetch data from BioRxiv server and log the output.
Task 3 (03:00): Index data fetched from MedRxiv server into Elasticsearch and log the output.
Task 4 (03:30): Index data fetched from BioRxiv server into Elasticsearch and log the output.

Updates:

Added entries to .gitignore to exclude NGINX settings and databases.
Modified .makim.yaml to include a target for downloading data from BioRxiv/MedRxiv API and automating cronjob updates.
Updated .sugar.yaml to define default settings and services for development and production environments.
Renamed conda/dev.yaml to conda/base.yaml and added dependencies.
Deleted containers/Dockerfile and containers/compose.yaml.
Added containers/compose.elasticsearch.yaml and containers/compose.nginx.yaml.
Created containers/dockerfile.nginx.
Removed containers/download-data.R and containers/entrypoint.sh.
Added containers/nginx/.gitkeep and data/.gitkeep.

These updates improve the automation of fetching, merging, and indexing data from BioRxiv and MedRxiv servers into Elasticsearch. Additionally, NGINX deployment services have been added to externalize the Elasticsearch API, and automation for cronjob updates streamlines operational workflows for the devops user.

xmnlab commented 9 months ago

@esloch do you have a time for a pair-review session?

esloch commented 9 months ago

some general comments about the PR

* data is inside the src folder ... it is not super common .. just when we need to distribute some data inside the package, in this case, we don't need to distribute that inside a package .. so probably the data shouldn't be inside a src folder

* this data looks like a dummy data, so maybe it should be inside ./tests/data

I agree that the app will not be distributed as a package. However, these data files are not dummy data. They are necessary for downloading subsequent databases based on the last downloaded file. So, I'm not sure if the tests/data directory would be a good place for them.

xmnlab commented 9 months ago

@esloch could you explain a bit more about that?

[
  {
    "title": "Example of title",
    "version": "0"
  }
]

because it really looks like dummy data.

but anyway, does we need to keep these data into the repo? normally if you need some kind of initial data to the project, you can just create a script that creates that. instead of adding that to the repo.

so maybe the way to go would be:

move this from ./src/data to ./data
create a script that generate the initial files if necessary ... but I think that this "example" data shouldn't be used .. otherwise we will contaminate the database

esloch commented 9 months ago

@esloch could you explain a bit more about that?
[
  {
    "title": "Example of title",
    "version": "0"
  }
]
because it really looks like dummy data.

but anyway, does we need to keep these data into the repo? normally if you need some kind of initial data to the project, you can just create a script that creates that. instead of adding that to the repo.

so maybe the way to go would be:

move this from ./src/data to ./data

create a script that generate the initial files if necessary ... but I think that this "example" data shouldn't be used .. otherwise we will contaminate the database

I got it, thanks for the tip. I'll download the oldest data from each of the servers (biorxiv and medrxiv) and set them as initial data for download in both the 'downloaded' and 'final' directories. additionally, I'll move the ./data/ to the project root.

xmnlab commented 9 months ago

also, remember to run pre-commit install locally to have the pre commit hook installed there

xmnlab commented 8 months ago

checklist

[ ] check if the query/search is case sensitive
[ ] check if it would be possible to index the document just the new data (not the entire dataset)

thegraphnetwork-literev / es-journals

feat: Implement CRON tasks for Elasticsearch data fetching and indexing, and Add NGINX deployment services #1

Pull Request Description:

Changes made in this pull request:

1. Implemented scripts to fetch data from MedRxiv and BioRxiv servers and merge it into the Elasticsearch database.

2. Added NGINX docker services to enhance the application deployment process.

3. Automated CRON configuration and updates for devops user to schedule tasks for fetching and indexing data from MedRxiv and BioRxiv servers.

Updates:

checklist