kids-first / kf-lib-data-ingest

🏭 Kids First Data Ingest Library
https://kids-first.github.io/kf-lib-data-ingest/
Apache License 2.0
5 stars 0 forks source link

Persist stage output to remote database in addition to local CSVs #408

Closed fiendish closed 4 years ago

fiendish commented 4 years ago

https://github.com/d3b-center/clinical-data-distribution/issues/2

znatty22 commented 4 years ago

cc @fiendish

Including @liberaliscomputing suggestion here so we remember:

liberaliscomputing commented 4 years ago

I am moving #414 here as a comment:

The current warehouse persistence is done by:

PCGC's default transform output has more than 200K rows. Since to_sql defaults to inserting all rows one by one, the velocity of insertion is pretty slow. This behavior can be adjusted by chunksize and method args (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html) to achieve faster insertion.

fiendish commented 4 years ago

Since to_sql defaults to inserting all rows one by one

That's not what the documentation says. It says the default is to send them "all at once" which is the opposite of one by one. Someone should tell them that their documentation is misleading.