Persist stage output to remote database in addition to local CSVs

fiendish commented 4 years ago

Configuration for database location/creds
Persist intermediate storage in warehouse
1. Drop extract and transform tables for the current study from the warehouse
2. Load new extract and transform tables into the warehouse
  - Each study will have its own database
  - Inside a study's database, each stage's output tables will be persisted to its own schema

https://github.com/d3b-center/clinical-data-distribution/issues/2

znatty22 commented 4 years ago

cc @fiendish

Including @liberaliscomputing suggestion here so we remember:

Ability to turn on/off loading into warehouse via CLI param or ingest config param

liberaliscomputing commented 4 years ago

I am moving #414 here as a comment:

The current warehouse persistence is done by:

~~https://github.com/kids-first/kf-lib-data-ingest/blob/optimize-accounting/kf_lib_data_ingest/etl/extract/extract.py#L136-L142~~

~~https://github.com/kids-first/kf-lib-data-ingest/blob/optimize-accounting/kf_lib_data_ingest/etl/transform/guided.py#L368-L374~~
https://github.com/kids-first/kf-lib-data-ingest/blob/do_not_mainline_this/kf_lib_data_ingest/etl/extract/extract.py#L130-L135
https://github.com/kids-first/kf-lib-data-ingest/blob/do_not_mainline_this/kf_lib_data_ingest/etl/transform/guided.py#L364-L370

PCGC's default transform output has more than 200K rows. Since to_sql defaults to inserting all rows one by one, the velocity of insertion is pretty slow. This behavior can be adjusted by chunksize and method args (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html) to achieve faster insertion.

fiendish commented 4 years ago

Since to_sql defaults to inserting all rows one by one

That's not what the documentation says. It says the default is to send them "all at once" which is the opposite of one by one. Someone should tell them that their documentation is misleading.

kids-first / kf-lib-data-ingest

Persist stage output to remote database in addition to local CSVs #408