monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

add data source Flybase #8

Closed nlwashington closed 6 years ago

nlwashington commented 9 years ago

Add Flybase as a datasource.

We can connect to the remote Postgres database directly as documented here: http://gmod.org/wiki/Public_Chado_Databases

we currently access the following tables to generate the genotype and phenotype views: pub feature_pub pub_dbxref feature_dbxref feature_relationship genotype feature_genotype cvterm stock_genotype stock organism environment phenotype phenstatement dbxref db phenotype_cvterm phendesc strain

nlwashington commented 9 years ago

This should be able to follow a similar model as we do for MGI... connect to the remote source, pull the tables that we need locally, then parse. If necessary, we can perform remote SQL queries, and save those locally too. But since the chado schema is already nearly triples, it should be a fairly straightforward transformation.

nlwashington commented 9 years ago

note that before we filtered the feature table to remove the "analysis" features, with filter on is_analysis=false, and removing residues. This saved space and time for processing. May or may not be important here.

  select * 
  from feature 
  where feature.is_analysis = false
  to stdout 
  with csv header ESCAPE AS E'\\\\'" > feature_noanalysis.csv
  select 
    feature_id, dbxref_id, organism_id, name, uniquename, null as residues,   
    seqlen, md5checksum, type_id, is_analysis, timeaccessioned, timelastmodified, is_obsolete 
  from feature where is_analysis = false
  to stdout 
  with csv header ESCAPE AS E'\\\\'" 
nlwashington commented 9 years ago

@cmungall the "phenotypes" in flybase are often composite terms. for example:

phenotype 9854 = "tergite | cell autonomous | somatic clone"

where each of the parts are mapped to external identifiers like: FBbt:00004476 ! tergite FBcv:0000153 ! cell autonomous FBcv:0000336 ! somatic clone

the previous iteration we simply just made a new class using the anatomical part that was affected, like:

FBbt:00004476PHENOTYPE

shall we continue this strategy? or do would you want to make new phenotypic classes in uberonpheno based on these composite phenotypic definitions?

nlwashington commented 9 years ago

Flybase also now has "human health" associations... basically links between alleles and diseases that they curate to be modeled by the allele.

They don't seem to be available in Chado yet. But they do do a dump here: ftp://ftp.flybase.net/releases/current/precomputed_files/human_disease/allele_human_disease_model_data_fb_2015_03.tsv.gz

Info on this curated data: http://flybase.org/static_pages/feature/previous/articles/2014_03/HumanDisease.html

nlwashington commented 9 years ago

also, need to get the db version from here: ftp://ftp.flybase.net/releases/current/

listed like FB2015_03

nlwashington commented 9 years ago

currently runs out of memory with 25M nodes (when writing the graph, not when building it).

nlwashington commented 9 years ago

fyi, now by adding the flag --format nt we can dump raw triples (ntriples) instead without running out of memory, and it dumps a 3.3G file.

nlwashington commented 8 years ago

@jmcmurry do you have an opinion on if we are to mint an identifier for a flybase genotype? they have internal numeric ids that they don't expose to the world. i was making these bnodes, but i think we need to materialize them. their other ids are like: FBbt:1234567 FBdv:1234567 FlyBase:FBgn1234567 (though wondering if we should actually make these like FB:gn1234567 instead)

some proposals, using the internal numeric: FBgenoInternal:12345 FBgeno:12345 FBgenotype:12345 MONARCH:FBgenotype12345

any suggestions? cc @cmungall

jmcmurry commented 8 years ago

though wondering if we should actually make these like FB:gn1234567 instead

Funny; I just asked @dosumis last week about this. I don't have a strong preference. Summary of tradoffs below. Flybase should decide. Implications for RRID and, hopefully NIH commons as well.

FlyBase:FBgn1234567 FB:gn1234567
Pros: This is the way it is presently used at GO; The LRI exists exactly as it has always been without modification, thus reducing likelihood of false negatives when text mining for these LRIs. Cons: adds length and awkward impression of 'double prefix' Pros: This is the version that is most visually similar to the current LRIs in existence, doesn’t add any length. Main Con: Creates an alternate version of identifiers already in existence. Minor con: Compared to ‘FlyBase’, ‘FB’ is perhaps slightly less recognizable to someone outside the community? Maybe irrelevant.
jmcmurry commented 8 years ago

@dosumis verdict (for existing resolvable identifiers) is FlyBase:FBgn1234567 which is convenient for all since this is consistent with what we are doing already.

jmcmurry commented 8 years ago

We may want to advocate a common pattern that all data integrators can use when minting/maintaining others' internal identifiers for them; let's call these surrogate identifiers:

Ergo, if FlyBase does materialize these IDs later, we can:

Note that this approach works only to address resolvability and transparency of LRI on behalf of the provider. If the persistence/uniqueness of that LRI is something that the provider can not commit to at this time, we should figure something else out altogether whereby we mint the LRIs as well; this would be a huge responsibility I would prefer to avoid.


**In the Monarch App, we currently hyperlink CURIEs either to application pages, or externally to their URIs of record, depending on what we want the user to do. Thus, we should consider both implications of our surrogacy.

Should we:

The first option is more work than the second but more hygienic and less confusing for the user.

I've advocated for type-agnostic URLs for our application pages, but that is a long way off and somewhat unrelated to the question at hand.

jmcmurry commented 8 years ago

@dosumis is FBgeno1234567 OK as a pattern for surrogate genotype identifiers?

dosumis commented 8 years ago

It fits with the FlyBase pattern. But you'd certainly need to make it clear this comes from Monarch & not FlyBase ( => Monarch:FBgeno1234567). Given how similar it looks to a FlyBase ID, you might want to check with them too. (Andy Schroeder might be a good to talk to).

cmungall commented 8 years ago

Yes, the CURIE would be Monarch:FBgeno1234567

jmcmurry commented 8 years ago

Hmm @dosumis ...The fact that it 'comes from Monarch' is an artifact of our surrogacy. While the identifier is surrogately owned by us (currently), the entity will always be owned by FlyBase.

Is your fear that people will copy the LRI and paste it into FlyBase search box? Does that possibility go away if you use the MONARCH: prefix instead? I don't feel adamant about it, but I'd prefer to see the CURIE as FlyBase-s:FBgeno1234567 or some minimally differentiated form.

This FlyBase-s approach has the following advantages:

Happy to be convinced this is a bad idea ... @cmungall? @nlwashington?

dosumis commented 8 years ago

Is your fear that people will copy the LRI and paste it into FlyBase search box?

That's my fear. Curie's are primarily for resolution of IDs. If you make it look like they come from FlyBase people will try to find entities in FlyBase (website/DB - API if they ever make one) via these IDs.

You can make sure that the entity resolved at Monarch makes the origins clear, acknowledging FB properly.

TomConlin commented 6 years ago

With skolemized blank nodes working as designed I think we should stop pretending we are minting identifiers and just accept what they are without the lipstick.

Perhaps up the priority of a blank node page in the Monarch app

cmungall commented 6 years ago

Or avoid blank nodes in the first place: http://bit.ly/monarch-kg-modeling

Represent simple links in the graph, treat as shortcut relations that can expand to granular OWL when required

TomConlin commented 6 years ago

Baby steps, first we have to admit we have a problem and have been papering it over. Laying blank nodes in their many forms bare and getting metrics on the percent of statements which are in fact structure not data will hopefully provide literal goals which incentivise convergent harmonization of milestones