Closed esloch closed 8 months ago
@esloch do you have a time for a pair-review session?
some general comments about the PR
* data is inside the src folder ... it is not super common .. just when we need to distribute some data inside the package, in this case, we don't need to distribute that inside a package .. so probably the data shouldn't be inside a src folder * this data looks like a dummy data, so maybe it should be inside ./tests/data
I agree that the app will not be distributed as a package. However, these data files are not dummy data. They are necessary for downloading subsequent databases based on the last downloaded file. So, I'm not sure if the tests/data directory would be a good place for them.
@esloch could you explain a bit more about that?
[
{
"title": "Example of title",
"version": "0"
}
]
because it really looks like dummy data.
but anyway, does we need to keep these data into the repo? normally if you need some kind of initial data to the project, you can just create a script that creates that. instead of adding that to the repo.
so maybe the way to go would be:
@esloch could you explain a bit more about that?
[ { "title": "Example of title", "version": "0" } ]
because it really looks like dummy data.
but anyway, does we need to keep these data into the repo? normally if you need some kind of initial data to the project, you can just create a script that creates that. instead of adding that to the repo.
so maybe the way to go would be:
- move this from ./src/data to ./data
- create a script that generate the initial files if necessary ... but I think that this "example" data shouldn't be used .. otherwise we will contaminate the database
I got it, thanks for the tip. I'll download the oldest data from each of the servers (biorxiv and medrxiv) and set them as initial data for download in both the 'downloaded' and 'final' directories. additionally, I'll move the ./data/ to the project root.
also, remember to run pre-commit install
locally to have the pre commit hook installed there
Pull Request Description:
This pull request implements CRON tasks to automate Elasticsearch indexing for the LiteRev platform. The tasks are scheduled to run at specific times daily, fetching data from MedRxiv and BioRxiv servers and then indexing it into Elasticsearch. Additionally, NGINX deployment services are introduced as Docker containers. Furthermore, automation for updating cronjobs for the
devops
user has been added to streamline operational workflows.Changes made in this pull request:
1. Implemented scripts to fetch data from MedRxiv and BioRxiv servers and merge it into the Elasticsearch database.
Download Data Script (
fetch_rxiv_data.sh
):Merge Data Script (
merge_rxiv_data.py
):Elasticsearch Indexing Runner Script (
run_index_rxiv_to_es.sh
):Elasticsearch Indexing Script (
index_rxiv_to_es.py
):2. Added NGINX docker services to enhance the application deployment process.
NGINX Service Configuration:
Certbot Service Configuration:
certbot/certbot
image.3. Automated CRON configuration and updates for devops user to schedule tasks for fetching and indexing data from MedRxiv and BioRxiv servers.
.makim
targetsetup-cron
to automatically update cronjobs for thedevops
user.Updates:
.gitignore
to exclude NGINX settings and databases..makim.yaml
to include a target for downloading data from BioRxiv/MedRxiv API and automating cronjob updates..sugar.yaml
to define default settings and services for development and production environments.conda/dev.yaml
toconda/base.yaml
and added dependencies.containers/Dockerfile
andcontainers/compose.yaml
.containers/compose.elasticsearch.yaml
andcontainers/compose.nginx.yaml
.containers/dockerfile.nginx
.containers/download-data.R
andcontainers/entrypoint.sh
.containers/nginx/.gitkeep
anddata/.gitkeep
.These updates improve the automation of fetching, merging, and indexing data from BioRxiv and MedRxiv servers into Elasticsearch. Additionally, NGINX deployment services have been added to externalize the Elasticsearch API, and automation for cronjob updates streamlines operational workflows for the devops user.