nimh-dsst / osm

OpenSciMetrics (OSM) applies NLP and LLM-based metrics and indicators related to transparency, data sharing, rigor, and open science on biomedical publications.
Creative Commons Zero v1.0 Universal
1 stars 2 forks source link

Configure GHA workflow to deploy infrastructure and application to each environment #64

Open marcelovilla opened 3 weeks ago

marcelovilla commented 3 weeks ago

There's an existing GHA workflow to deploy the infrastructure. However, it seems it has not been run yet and that there are some modifications that need to be made:

leej3 commented 3 weeks ago

Sounds good. Some notes.

it seems it has not been run yet

Correct. A placeholder more than anything. You can safely ignore it.

We should make sure we build and push the Docker image

There is now a base image to reduce redundant layers. That should be pushed too.

but also for the dashboard

At the moment I resorted to a hack and copied in some data to the dashboard container. Some strategy for data caching should be used here to avoid downloading from the db for each redeployment. It's not sensitive so I think a reasonably strategy might be to use a github cache, copy it to the deployed instance, and then mount the data into the dashboard container using a docker bind mount. Addressing issues that arise when the data schema changes in a backward incompatible way should be addressed. It might be that the data populated in a non blocking way.

marcelovilla commented 3 weeks ago

@leej3 thanks for the notes.

There is now a base image to reduce redundant layers. That should be pushed too.

It seems the base image is used to build both the API and dashboard images. Instead of pushing it too, I suggest we write our deploy workflow so that we build it locally, build the API and dashboard images on top of it, and then push the latter two.

At the moment I resorted to a hack and copied in some data to the dashboard container. Some strategy for data caching should be used here to avoid downloading from the db for each redeployment. It's not sensitive so I think a reasonably strategy might be to use a github cache, copy it to the deployed instance, and then mount the data into the dashboard container using a docker bind mount.

We'll explore what a good way of accomplishing this would be but I think your suggestion should work fine. Out of curiosity, why do we need to have data available in the container if it's already stored in the DB? Couldn't we query the DB on the fly from the application itself? Is it because it's a lot of data?

leej3 commented 3 weeks ago

Couldn't we query the DB on the fly from the application itself? Is it because it's a lot of data?

Good point. For local development it was taking about 3 mins, and lots of unnecessary internet usage. But if the download is happening on an EC2 instance it might be that the download happens quickly enough to not worry about caching. Try it first with no cache...

I suggest we write our deploy workflow so that we build it locally, build the API and dashboard images on top of it, and then push the latter two.

Sounds good.