thegraphnetwork-literev / es-journals

BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

feat: Implement CRON tasks for Elasticsearch data fetching and indexing, and Add NGINX deployment services #1

Closed esloch closed 7 months ago

esloch commented 7 months ago

Pull Request Description:

This pull request implements CRON tasks to automate Elasticsearch indexing for the LiteRev platform. The tasks are scheduled to run at specific times daily, fetching data from MedRxiv and BioRxiv servers and then indexing it into Elasticsearch. Additionally, NGINX deployment services are introduced as Docker containers. Furthermore, automation for updating cronjobs for the devops user has been added to streamline operational workflows.

Changes made in this pull request:

1. Implemented scripts to fetch data from MedRxiv and BioRxiv servers and merge it into the Elasticsearch database.

2. Added NGINX docker services to enhance the application deployment process.

3. Automated CRON configuration and updates for devops user to schedule tasks for fetching and indexing data from MedRxiv and BioRxiv servers.

Updates:

These updates improve the automation of fetching, merging, and indexing data from BioRxiv and MedRxiv servers into Elasticsearch. Additionally, NGINX deployment services have been added to externalize the Elasticsearch API, and automation for cronjob updates streamlines operational workflows for the devops user.

xmnlab commented 7 months ago

@esloch do you have a time for a pair-review session?

esloch commented 7 months ago

some general comments about the PR

* data is inside the src folder ... it is not super common .. just when we need to distribute some data inside the package, in this case, we don't need to distribute that inside a package .. so probably the data shouldn't be inside a src folder

* this data looks like a dummy data, so maybe it should be inside ./tests/data

I agree that the app will not be distributed as a package. However, these data files are not dummy data. They are necessary for downloading subsequent databases based on the last downloaded file. So, I'm not sure if the tests/data directory would be a good place for them.

xmnlab commented 7 months ago

@esloch could you explain a bit more about that?

[
  {
    "title": "Example of title",
    "version": "0"
  }
]

because it really looks like dummy data.

but anyway, does we need to keep these data into the repo? normally if you need some kind of initial data to the project, you can just create a script that creates that. instead of adding that to the repo.

so maybe the way to go would be:

  1. move this from ./src/data to ./data
  2. create a script that generate the initial files if necessary ... but I think that this "example" data shouldn't be used .. otherwise we will contaminate the database
esloch commented 7 months ago

@esloch could you explain a bit more about that?

[
  {
    "title": "Example of title",
    "version": "0"
  }
]

because it really looks like dummy data.

but anyway, does we need to keep these data into the repo? normally if you need some kind of initial data to the project, you can just create a script that creates that. instead of adding that to the repo.

so maybe the way to go would be:

  1. move this from ./src/data to ./data
  2. create a script that generate the initial files if necessary ... but I think that this "example" data shouldn't be used .. otherwise we will contaminate the database

I got it, thanks for the tip. I'll download the oldest data from each of the servers (biorxiv and medrxiv) and set them as initial data for download in both the 'downloaded' and 'final' directories. additionally, I'll move the ./data/ to the project root.

xmnlab commented 7 months ago

also, remember to run pre-commit install locally to have the pre commit hook installed there

xmnlab commented 7 months ago

checklist