openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

MySQL access #16

Closed abravo84 closed 6 years ago

abravo84 commented 6 years ago

Hi guys! As commented before, we are developing a Gate component. This component uses multiple word embeddings (such as PubMed, Wikipedia and Google). Word embeddings are available in our mysql database, because each one contains more than 7Gb of information.

We can enable access to this database, but we do not know if it will have any problems on the openminted platform.

à.

reckart commented 6 years ago

As long as the DB is inside the Docker image and as long as it is read-only (i.e. no state must be persisted across component executions), IMHO it should not be a problem.

reckart commented 6 years ago

Mind, the component would not access your DB but rather a MySQL DB running in the Docker image that you submit which contains your embeddings data.

galanisd commented 6 years ago

I agree with @reckart. If the DB is running in the container that is created from the image that you provide you will not have any issues now or in the future. If it is installed and running somewhere else, for example, in an external server then it is likely that a firewall will be blocking access.

antleb commented 6 years ago

We are talking about several gigabytes of disk space (7gb per embedding with multiple embeddings) and it would require too much disk space in the openminted infrastructures (test and production). My suggestion is to let the data in the original mysql database and allow access to it from the running component, even as an exception to the general "no network access" rule.

@saxtouri is that possible?

galanisd commented 6 years ago

@abravo84

Is it possible to estimate the size of a Docker image that contains the DB with all required data (word embeddings)?

reckart commented 6 years ago

@antleb accessing an external DB is fragile, potentially very slow due to network latency depending on where the hosts are located, and also impedes the reproducibility of workflows built using the component. I am surprised that a couple of gigabytes are an issue these days.