All of changes here are required for 326_nih, which concerns the (re)collection nih data, with various improvements in terms of data quality and pipeline running time, whilst factoring nih out of the health-mosaic project, for use in eurito.
A few utils for:
Merging data by upsertion. There can be some ambiguity about what parts are being updated and how using raw SQL, and so instead I opt for 0) "manually" creating updated rows in python, based on the most recent non-null data 1) opening a transaction 2) dropping rows which are going to be updated 3) inserting the new rows
This can be fairly slow, but it turns out using insert(_class).values(chunk) is 5x more performant than session.bulk_save_objects(objs) in the least impressive case and potentially 10-100x more performant depending on the number of rows being inserted. In the past this was causing a bottleneck for the NiH collection, but it looks like this is solved now.
Additionally:
bucket_keys is introduced as a util, for retrieving all of the keys in a given bucket. It's a standard piece of boilerplate that I'm sick of.
Full set of tests add to cover the new features, plus you'll have to take my word that I've run the arxiv, crunchbase, cordis, gtr and patstat collections in dev mode and the new features haven't killed anything... (this is why we need end-to-end tests in DAPS2...). These uncovered a need for a small number of very minor updates before data is inserted into the database, for some of the oldest parts of the pipelines (crunchbase, arxiv, gtr)
Developed with to facilitate #326
All of changes here are required for
326_nih
, which concerns the (re)collectionnih
data, with various improvements in terms of data quality and pipeline running time, whilst factoringnih
out of thehealth-mosaic
project, for use ineurito
.A few utils for:
insert(_class).values(chunk)
is 5x more performant thansession.bulk_save_objects(objs)
in the least impressive case and potentially 10-100x more performant depending on the number of rows being inserted. In the past this was causing a bottleneck for the NiH collection, but it looks like this is solved now.Additionally:
bucket_keys
is introduced as a util, for retrieving all of the keys in a given bucket. It's a standard piece of boilerplate that I'm sick of.Full set of tests add to cover the new features, plus you'll have to take my word that I've run the
arxiv
,crunchbase
,cordis
,gtr
andpatstat
collections indev
mode and the new features haven't killed anything... (this is why we need end-to-end tests in DAPS2...). These uncovered a need for a small number of very minor updates before data is inserted into the database, for some of the oldest parts of the pipelines (crunchbase
,arxiv
,gtr
)