monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Panther: refactor for efficency and correctness #881

Closed TomConlin closed 4 years ago

TomConlin commented 4 years ago

I had wanted to make an incremental improvement in longest running timed ingest but tests runs indicate that although more orthologs are processed in the same time
many more orthologs are being found to process. spot checks (on sample run) indicate the new orthologs should have been there all along. missed genes seem to preferentially involve Ensembl identifiers.

I expect this ingest will experience a wash for time (~16 hours) and output 300% of previous (including 100% of previous).

Will also note the additional genes/orthology/relations are not of a different kind. Just a haphazard sprinkle of more of the same.

Update after full run

6 hours 17 minutes runtime and virtually identical output (9k fewer out of 52.7M) things lost include URI incorrectly expressed as literals. things gained (all 22 of them) are limited to Dataset metadata.

Plausible explanation; running on data sets truncated to the first few tens of thousands of rows incurred higher fixed cost, but the new version does a better of extracting.

wornbase may be over represented in the 9k statements dropped reworked gene cleaning logic, should be improved


Rework the Jenkins file a bit. find a way of validating it before use.