Closed fiendish closed 4 years ago
cc @fiendish
Including @liberaliscomputing suggestion here so we remember:
I am moving #414 here as a comment:
The current warehouse persistence is done by:
PCGC's default transform output has more than 200K rows. Since to_sql
defaults to inserting all rows one by one, the velocity of insertion is pretty slow. This behavior can be adjusted by chunksize and method args (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html) to achieve faster insertion.
Since to_sql defaults to inserting all rows one by one
That's not what the documentation says. It says the default is to send them "all at once" which is the opposite of one by one. Someone should tell them that their documentation is misleading.
https://github.com/d3b-center/clinical-data-distribution/issues/2