ml4ai / delphi

Framework for assembling causal probabilistic models from text and software.
http://ml4ai.github.io/delphi
Apache License 2.0
24 stars 17 forks source link

Set up DB for storing parameterization data. #65

Closed adarshp closed 5 years ago

adarshp commented 6 years ago

Right now, Delphi uses data tables stored as plain text files to parameterize its models. However, this will not scale with increasing amounts of data. Another concern is minimization of git repo bloat. For these reasons, it might be good to have an online database (hosted on vision or a SISTA server) that Delphi can query programmatically. I'm leaning towards Neo4j since that's the DB system I have the most experience with.

cl4yton commented 6 years ago

This is a good point. My only concern with hosting data through an online DB is if we're in a situation where there is no internet connection: (1) during dev (on a plane or generally somewhere without internet); (2) in a demo situation (e.g., @DARPA or another secured facility); (3) maintaining functionality to that others are still able to easily deploy and use delphi without a lot of setup/overhead and additional system interaction/coordination requirements. For this reason, I think we should ensure there is always a way to have a local data-serving DB -- something that might be lightweight relative to the "main" server, and still obtain a "completely functional" delphi instance if cloned from the repo. We do already handle a version of this with our dependency on Indra (when Indra not present, we can still do other things). I don't have a strong opinion about DB; my only thoughts here are (1) to keep as few dependencies as possible and (2) keep things as close to Python 3 as possible.

adarshp commented 6 years ago

Completely agree about having the option to have a local copy of the DB. However, I'm not sure if it should be in the same repo as the Delphi source code - the data might undergo lots of changes in the future, leading to very large diffs, and a repo that takes forever to clone.

Another option that I was thinking about is to have a 2-step installation process - first install Delphi via pip, then download the DB from a separate URL (I could put it on vision) and place it in the delphi/data directory. This is sort of what eidos does for grounding - the 5 GB Glove embeddings are not included in the repo, but pointed to in the repo's README. The causal relation extraction part of Eidos works without the embeddings, but the grounding doesn't happen.

In terms of DB choice, good point about keeping the number of dependencies small - perhaps SQLite might be a better choice, since it's supported by the Python standard library.

Actually, just thought of something - perhaps the Jataware folks can help? @jgawrilo - do you have any advice on how to distribute potentially large amounts of data (that can and will change) along with source code while not blowing up the size of the git repository?

cl4yton commented 6 years ago

Just to capture in this thread what I think we (@adarshp, Clay) arrived at yesterday, considering having three options (1) A lightweight local DB (likely sqlite) with enough data to be able to run/exercise the entire functionality delphi (perhaps this is implicated in unit testing?), so given a local clone, one has complete version of delphi to dev against. (2) Separate hosted DB for full delphi-related storage (likely hosted on vision) (3) A downloadable full copy of (2) to also have full DB run locally (as @adarshp notes is done in eidos) @adarshp : Did I capture everything?

adarshp commented 6 years ago

@cl4yton - yep! That covers everything.

adarshp commented 6 years ago

This could (and I think potentially should) be the same as the database that we will use to store Delphi models to be called by Uncharted.

jgawrilo commented 6 years ago

Very sorry I missed this! Not sure what happened. Have you landed on a solution? How much data are you talking about? It may make sense to have a program-wide db instance and/or to have procedures detailed on how to set up and populate the instance.

adarshp commented 6 years ago

No worries! We are going with SQLite for a couple of reasons:

It's not a huge amount of data right now if we restrict ourselves to South Sudan and a couple of data sources - less than 15 MB, but we expect that going forward, as we expand to other countries and multiple data sources, the data required for parameterization can quickly grow and be infeasible to keep in memory.

I agree it would make sense to have a program-wide db instance - I know that @bgyori would also like to use this data to parameterize ODE models assembled by INDRA. Is this something that Jataware would set up?

bgyori commented 6 years ago

I am also wondering if the data catalog group at ISI is already working on the infrastructure to host data like this. Would that be the right place to store FAO, WDI, FEWS, etc. data? Or maybe I misunderstand what the data catalog folks are aiming to do!

adarshp commented 6 years ago

As of the time of the WC hackathon, my understanding is that the data catalog was meant to provide links to download the data from external sources, rather than ISI hosting the data themselves. That being said, Jay mentioned that perhaps they might do that for a few high-value datasets such as FAO/WDI datasets. I'll follow up with him to see what they are up to.

bgyori commented 6 years ago

In any case, I think SQLite is nice and lightweight, getting the data into a DB is as easy as this:

import pandas
from sqlalchemy import create_engine
engine = create_engine('sqlite://', echo=False)
df = pandas.read_table('WDIData.csv', sep=',')
df.to_sql('wdi', con=engine)

Then you can do e.g.:

>>> engine.execute("SELECT * FROM wdi WHERE `Country Name` LIKE 'South Sudan'").fetchone()

(360000, 'South Sudan', 'SSD', '2005 PPP conversion factor, GDP (LCU per international $)', 'PA.NUS.PPP.05', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)
adarshp commented 6 years ago

That's awesome, thanks for the code snippet! I just learned about sqlalchemy these past couple of weeks in the process of implementing the Uncharted API - I will try out your method and try to just ingest the south_sudan_data.csv table with pandas and output to sqlite.

adarshp commented 5 years ago

Closed by #149