monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

rewrite of Dataset.py to improve metadata emitted by Dipper #809

Closed justaddcoffee closed 5 years ago

justaddcoffee commented 5 years ago

To address #792, #753, #428, #545, and https://github.com/monarch-initiative/monarch-ui/issues/58

This PR will completely rewrite the Dataset.py to produce metadata that complies with the dataset HCLS spec

justaddcoffee commented 5 years ago

@TomConlin per our discussion, have a look at these commits

TomConlin commented 5 years ago

General Note: Tools I use to generate code which aspires to be PEP compliant impede my ability to review code which is less compliant.

- "Too many arguments"  happened too much to mention individually.(not much to do)
- line too long happens to much to mention individually (please address)

- We may not be able to stop ourselves from stuffing too many arguments 
into every function, but we can avoid putting them all on the same line.

- line limit 88 chars, favour leftmost continuations.

- break super long constants into globals & assemble then if need be 
- constants between ingests belong in super class
- i.e. source logo could be a function of 'name' & constants

- fix PEP issues introduced by this pr as found via static analysis.

- many thanks for the renaming of legacy variables and function e.g. 'gu'
- thanks for catching static fx w/self in panther    

Time. I have not done extensive timing tests and as far as I can see neither has anyone else I did hear mentioned "it adds a few minutes" but I would not be shocked to learn it added a few hours as sparql is over kill for plain counting and within python is only suitable for toy scale problems. Justin is going to add in flags to skip including the counts but that adds yet another path of not testing what goes to production.

I will advocate that dipper lives up to is name "data ingest pipeline"and perform the counts at a later stage in the pipeline as we move towards breaking apart the monolithic catch all we are calling a pipeline.


some notes on individual files

justaddcoffee commented 5 years ago

Time. I have not done extensive timing tests and as far as I can see neither has anyone else I did hear mentioned "it adds a few minutes" but I would not be shocked to learn it added a few hours as sparql is over kill for plain counting and within python is only suitable for toy scale problems. Justin is going to add in flags to skip including the counts but that adds yet another path of not testing what goes to production.

We perhaps should just remove these SPARQL queries I added - they arguably don't really add much to the metadata and inarguably add time to the ingests.

  • fix PEP issues introduced by this pr as found via static analysis.

Can you elaborate on this item?

justaddcoffee commented 5 years ago

OMIA are we sure ncbigene call changes are correct? looks regressive.

@TomConlin can you clarify this one? I don't see a change to the ncbigene call in OMIA

TomConlin commented 5 years ago

The last several (non category-pr) dipper pipeline runs panther finished in 12 - 13 hours. with 'dataset pr' and sparql counts on,
panther finished in 22.5 hours.

The sparql counts code have already been removed from the PR but it is worthwhile to record the results of the experiment.

will make another note when the results of the dataset-pr without sparql counts have finished and expect any time increase to be negligible compared with the great improvements in meta data.

justaddcoffee commented 5 years ago

Cheers @TomConlin

TomConlin commented 5 years ago

with omia & ncbi I am comparing what is in the mainline now (a couple of days ago) and your branch... oh I see my fork has uncommited changes leftover from looking for the 200k omim call issue. I need to commit (not likely) or revert. sorry for the false alarm.

TomConlin commented 5 years ago

re your last two commits they are not part of this PR there is a mechanism to employ when data changes and it does not involve hard coding numbers of columns.

justaddcoffee commented 5 years ago

re your last two commits they are not part of this PR

Agreed, that was a little lame to hard code column numbers. (See last commit.) Just trying to fix the failing ingests - it's a little hard to make sure the PR is working when the ingests from master don't run to completion.

there is a mechanism to employ when data changes and it does not involve hard coding numbers of columns.

What is the mechanism?

TomConlin commented 5 years ago

Looking at the dataset files from the Jenkins pipeline run 2019 Sep 16

justaddcoffee commented 5 years ago

re your last two commits they are not part of this PR

Made a separate PR for these, see #825. I've also added a fix for MMRRC ingest, which was failing.

TomConlin commented 5 years ago

the mechanism, if you have not found it yet, is the columns list in the files dict in each ingest. There is a fx Source.check_fileheader() which provide a shared behaviour for what is done when headers change.

Okay, thanks. This makes sense, although there was/is no column info in AnimalQTLdb.

In any case, I've reverted these fixes with e0f53643bdd073126f92e7df9dc6c87b26b76947 and cccc7cf73e3c8eb93ebca09c24460f445f095d5c, since they are a bit out of scope - I've made a separate PR instead.

justaddcoffee commented 5 years ago

• we discussed a code/feature freeze for this PR last week

Sorry for the late edits

• the @ prefix section does not have Monarch prefixes (please include any prefix used)

Changed dataset to use MonarchArchive prefix, and to write out CURIEs (e.g. MonarchArchive:rgd) instead of IRIs. (commit 3693fde16dfbb7d32af4e6cbb3fe9f38c583d5d4)

• the dcterms:identifier IRI object is written as a string literal.

Thanks for catching that - fixed. (commit 805e7e25f20a400414ba51e871919097fb82e9b9)

• the format of the the monarch archive IRI is: ◦ MonarchArchive/ingest_datestamp which is never a valid url ◦ we had discussed changing to the format ◦ MonarchArchive/datestamp/rdf/ingest which can be a valid url if 'ttl' is changed to 'rdf'

Yep, I like the new URL pattern MonarchArchive/datestamp/rdf/ingest

The downloadURL I’m writing in this PR conforms to what we discussed I think. It’s pointing to: http://archive.monarchinitiative.org/[version]/rdf/[source].[distribution file type] E.g. Here’s the downloadURL triple I’m writing for RGD: dcterms:downloadURL https://archive.monarchinitiative.org/201908/rdf/rgd.ttl ;

The summary, version, and distribution level IRIs I'm writing are identifiers that I think are coherent, but obviously are not URLs currently (i.e. they don’t point to a file or directory that exists). Summary level IRI: https://archive.monarchinitiative.org/[source] Version level IRI: https://archive.monarchinitiative.org/[source]_[version] Distribution level IRI: <https://archive.monarchinitiative.org/[source]_[version].[distribution file type]>

For example for RGD these are: https://archive.monarchinitiative.org/rgd https://archive.monarchinitiative.org/rgd_201908 https://archive.monarchinitiative.org/rgd_201908.ttl

Could you let me know what you’d prefer this for the IRIs? The following conform more to what you are talking about, but only the distribution level IRI points to anything real (i.e. only the last is an actual URL): https://archive.monarchinitiative.org/rgd # don’t think there is a way to make this point to a file or directory that actually exists https://archive.monarchinitiative.org/201908/rdf/rgd # also not pointing to anything real, but pointing to a directory that exists https://archive.monarchinitiative.org/201908/rdf/rgd.ttl # pointing to something downloadable I think these are a little less meaningful than the present IRIs I’m writing in this PR, but glad to change them if you like.

• we had discussed changing the directory ttl to rdf because not all output can be turtle ◦ ttl could remain a symlink of rdf for backwards compatibility

Yes, agreed - it’s good if TTLs and NTs live in the same directory, and also good if we keep this backward compatible.

• having exact same title three times is excessive ◦ please omit two or rewrite as something that justifies existence.

I’ve appended the version number to the version level title, and the file type to the distribution level title. I don’t think we can omit two - all three titles must be present according to the spec. (commit 13a79c76975435cf348a5016eecca1e2339a0121)

justaddcoffee commented 5 years ago

For my own info, two other tickets that are possibly addressed by this PR: https://github.com/monarch-initiative/dipper/issues/428 https://github.com/monarch-initiative/dipper/issues/545

TomConlin commented 5 years ago

Okay, thanks. This makes sense, although there was/is no column info in AnimalQTLdb.

Thanks, there was/is no column info in any ingest till very recently which is why we add it in when we have the file open.

E.g. Here’s the downloadURL triple I’m writing for RGD: dcterms:downloadURL https://archive.monarchinitiative.org/201908/rdf/rgd.ttl ;

That URL now resolves! (as will any of the .ttl or .nt)


More helpful would be better, we do not have version and summary for specific ingests, because they are not what we are producing.

Ingest rdf are a necessary subset of what we do produce which is an episodic dipper release.

I expect the only legitimate summary and version data for the ingest would have to be from the source, about the source, and although some ingests will have more than enough metadata to support these dataset object others will not. The only certainty is all will be different and out of our control. (read more perpetual maintenance)

For this I would argue our Version and Summary metadata URLs should be to Dipper release products (despite being repeated in each ingest)

Distribution level URL resolving to the download file I think is great.

Version level URL https://archive.monarchinitiative.org/[version] Could have a static or generated index.html to provide helpful wordage about our insufficiently frequent release process.

Summary level URL https://archive.monarchinitiative.org Should have a static index.html to provide helpful wordage about the datasets provided.

justaddcoffee commented 5 years ago

More helpful would be better, we do not have version and summary for specific ingests, because they are not what we are producing.

I guess I disagree with this. Each ingest is run separately and is emitted as a separate TTL|NT file, so I think it makes sense to have separate version and summary IRIs for each separate ingest of each source. This distinction seems useful/appropriate. For example, if we use the same version IRI for all ingests in a given Dipper data release version (say, https://archive.monarchinitiative.org/201909/), each of the 36 or so TTL files we emit in each version will claim to be this generic 201909 version of the dataset, but they'll all be different.

Also, if we conflate the version and summary IRIs for all ingests, we remove the ability to refer to the individual ingests separately further down the stack. It seems possible that this ability will be useful downstream when we use this metadata for provenance.

@cmungall thoughts?

TomConlin commented 5 years ago

We only loose the ability to refer to the individual ingests separately if we choose not to use the distribution level URL, the other identidiers do not refer to anything that does or will exists.

Appending the ingest-name to the intermediate level URL for Summary and Version is fine. Especially if written as an link to an anchor tag

i.e. https://archive.monarchinitiative.org/201908/rdf/rgd.ttl https://archive.monarchinitiative.org/201908/#rgd https://archive.monarchinitiative.org/#rgd

clearly not worse than more deliberate 404s even while there is no specific page to land on and they can be made useful some day by putting up index pages with matching anchors.

cmungall commented 5 years ago

Sorry, will try and give this some thought next week - is this a blocker?

On Tue, Sep 24, 2019 at 3:35 PM Justin Reese notifications@github.com wrote:

More helpful would be better, we do not have version and summary for specific ingests, because they are not what we are producing.

I guess I disagree with this. Each ingest is run separately and is emitted as a separate TTL|NT file, so I think it makes sense to have separate version and summary IRIs for each separate ingest of each source. This distinction seems useful/appropriate. For example, if we use the same version IRI for all ingests in a given Dipper data release version (say, https://archive.monarchinitiative.org/201909/), each of the 36 or so TTL files we emit in each version will claim to be this generic version of the Monarch dataset, but they'll all be different.

Also, if we conflate the version and summary IRIs for all ingests, we remove the ability to refer to the individual ingests separately further down the stack. It seems possible that this ability will be useful downstream when we use this metadata for provenance.

@cmungall https://github.com/cmungall thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/dipper/pull/809?email_source=notifications&email_token=AAAMMOMBQEQU47DKJO6OUUTQLKI3DA5CNFSM4IKPJNOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7QAJCY#issuecomment-534774923, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMMONIF3D4GMTAHI4FXYLQLKI3DANCNFSM4IKPJNOA .

justaddcoffee commented 5 years ago

Sorry, will try and give this some thought next week - is this a blocker?

No worries - not really a blocker, more of a request for a philosophical second opinion.

TomConlin commented 5 years ago

+1