I had wanted to make an incremental improvement in longest running timed ingest
but tests runs indicate that although more orthologs are processed in the same time
many more orthologs are being found to process.
spot checks (on sample run) indicate the new orthologs should have been there all along.
missed genes seem to preferentially involve Ensembl identifiers.
I expect this ingest will experience a wash for time (~16 hours) and output 300% of previous (including 100% of previous).
Will also note the additional genes/orthology/relations are not of a different kind.
Just a haphazard sprinkle of more of the same.
Update after full run
6 hours 17 minutes runtime and virtually identical output (9k fewer out of 52.7M)
things lost include URI incorrectly expressed as literals.
things gained (all 22 of them) are limited to Dataset metadata.
Plausible explanation; running on data sets truncated to the first few tens of thousands of rows incurred higher fixed cost, but the new version does a better of extracting.
wornbase may be over represented in the 9k statements dropped
reworked gene cleaning logic, should be improved
Rework the Jenkins file a bit.
find a way of validating it before use.
I had wanted to make an incremental improvement in longest running timed ingest but tests runs indicate that although more orthologs are processed in the same time
many more orthologs are being found to process. spot checks (on sample run) indicate the new orthologs should have been there all along. missed genes seem to preferentially involve Ensembl identifiers.
I expect this ingest will experience a wash for time (~16 hours) and output 300% of previous (including 100% of previous).
Will also note the additional genes/orthology/relations are not of a different kind. Just a haphazard sprinkle of more of the same.
Update after full run
6 hours 17 minutes runtime and virtually identical output (9k fewer out of 52.7M) things lost include URI incorrectly expressed as literals. things gained (all 22 of them) are limited to Dataset metadata.
Plausible explanation; running on data sets truncated to the first few tens of thousands of rows incurred higher fixed cost, but the new version does a better of extracting.
wornbase may be over represented in the 9k statements dropped reworked gene cleaning logic, should be improved
Rework the Jenkins file a bit. find a way of validating it before use.