monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

add data source zfin #12

Closed nlwashington closed 8 years ago

nlwashington commented 9 years ago

add zfin geno-pheno data: http://www.zfin.org/downloads

we currently take in the: http://zfin.org/downloads/Morpholinos.txt http://zfin.org/Downloads/phenotype.txt http://zfin.org/Downloads/pheno_environment.txt http://zfin.org/Downloads/stage_ontology.txt http://zfin.org/Downloads/anatomy_item.txt http://zfin.org/Downloads/wildtype-expression.txt http://zfin.org/downloads/mappings.txt http://zfin.org/downloads/genotype_backgrounds.txt http://zfin.org/downloads/genbank.txt http://zfin.org/downloads/uniprot.txt http://zfin.org/downloads/gene.txt http://zfin.org/downloads/wildtypes.txt http://zfin.org/downloads/genotype_features.txt http://zfin.org/downloads/human_orthos.txt http://zfin.org/zf_info/dbase/cont.html

nlwashington commented 9 years ago

I got a start on this with c046ea1d2d69f27ed2d1a9bad50dd0731b052cc2, a90871862356a38cf29995caac9e485a191679ae. this takes care of the basic genotype (and it's parts) and G2P associations. note that this only takes care of the intrinsic genotypes, not the backgrounds or the extrinsic/effective phenotypes.

nlwashington commented 9 years ago

it might be worth exploring these as well: http://zfin.org/downloads/features.txt http://zfin.org/downloads/features-affected-genes.txt http://zfin.org/downloads/gene_marker_relationship.txt (the "relationships" need some refactoring/mapping to GENO)

and there's also more info on genotype-related things: http://zfin.org/downloads/CRISPR.txt http://zfin.org/downloads/TALEN.txt

also, there's a zfish pub to pmid translation, which is different than the one i currently use: http://zfin.org/downloads/pub_to_pubmed_id_translation.txt

nlwashington commented 9 years ago

@cmungall and @drseb what shall we do about the ZP ids in this pipeline. we are using an old static file with the 6-column mapping. it is about 9 months old now. i did a check, and it seems that we are presently missing ~3K/9K terms actually used in recent data.

i think the proper mapping file is this one: http://compbio.charite.de/hudson/job/hpo.ontology.uberpheno/lastSuccessfulBuild/artifact/data/dataForUberpheno/other/zp.annot_sourceinfo

shall we use this? is it actively updated?

nlwashington commented 9 years ago

or this one? https://phenotype-ontologies.googlecode.com/svn/trunk/src/ontology/zp/zp-mapping.txt

nlwashington commented 9 years ago

well, i guess i have been using the right one all along; however, there are ~3K classes that are missing. how shall i get you that report? added ticket here: https://code.google.com/p/phenotype-ontologies/issues/detail?id=117

drseb commented 9 years ago

Hi,

@cmungall and @drseb what shall we do about the ZP ids in this pipeline. we are using an old static file with the 6-column mapping. it is about 9 months old now. i did a check, and it seems that we are presently missing ~3K/9K terms actually used in recent data.

i think the proper mapping file is this one: http://compbio.charite.de/hudson/job/hpo.ontology.uberpheno/lastSuccessfulBuild/artifact/data/dataForUberpheno/other/zp.annot_sourceinfo

shall we use this? is it actively updated?

I was always wondering which file you might use! Also, it is funny that you are writing right now, because just yesterday I started to completely rework this whole thing. I will send you a separate email today or next week with all the updates. I have to speak with Chris about some details before.

Anyway, none of the files you mentioned is now updated anymore!!

Seb

cmungall commented 9 years ago

Spoke to @drseb this morning, we'll wait til next week to vet the new zp job a bit more and pick up after that

bryanlaraway commented 9 years ago

Here's a list of issues I've run into so far with this resource not included in #93:

-The morpholino data includes the morpholino sequence, but no target sequence. Is it valid to compute the reverse complement of the morpholino sequence as the target sequence, or is the target sequence not the full reverse complement due to incomplete/imperfect binding between the morpholino and the potential target sequence? -Wildtypes as backgrounds: Should I add code to declare all wild types as backgrounds, or is a background only a background in relation to a mutant phenotype? Seems like the latter, so left that declaration out. It does result in a few of the wild types not being labeled as a background if there aren’t any mutant genotypes with that background (6 aren’t included so far, I believe). -Pseudogenes: Would these also be included as a class, similar to genes? Do we want to keep them or filter them out?

nlwashington commented 9 years ago

Is it valid to compute the reverse complement of the morpholino sequence as the target sequence...?

No. I would leave out sequence information completely for now.

Should I add code to declare all wild types as backgrounds...?

well, i think that all of the items in http://zfin.org/downloads/genotype_backgrounds.txt are backgrounds; not sure where else they would be coming from?

Pseudogenes: Would these also be included as a class, similar to genes?

Yes, you would subclass them as pseudogenes SO:0000336.

bryanlaraway commented 9 years ago

Is it valid to compute the reverse complement of the morpholino sequence as the target sequence...?

No. I would leave out sequence information completely for now.

Done!

Should I add code to declare all wild types as backgrounds...?

well, i think that all of the items in http://zfin.org/downloads/genotype_backgrounds.txt are backgrounds; not sure where else they would be coming from?

The genotype_backgrounds.txt maps genotype IDs to the genotype background IDs, with the genotype backgrounds also being genotypes (e.g. ZDB-GENO-030619-2). These backgrounds correspond to entries in the http://zfin.org/downloads/wildtypes.txt file. The question was more in regards to whether all of the genotypes in the wild types file should also be marked as backgrounds, even if there are no mutants/variants with that wild type as a background in the genotype_backgrounds.txt file. If my assumption is correct, that a genotype background is only a background when referring to a mutant genotype constructed using that background, then I shouldn't automatically declare all wild types as backgrounds. Conversely, I could go ahead and mark all wild types as backgrounds, they just wouldn't have any genotypes that indicate them as being backgrounds until new mutant genotypes are added that use those wild types as backgrounds.

Pseudogenes: Would these also be included as a class, similar to genes?

Yes, you would subclass them as pseudogenes SO:0000336.

Done!

drseb commented 9 years ago

@nlwashington @cmungall Here is the current status:

1) The code for the ZP generation has been moved from googlecode to github and is now available at https://github.com/sba1/bio-ontology-zp . I have updated some parts of the code:

2) The excutable jar for the code mentioned above is generated at http://compbio.charite.de/hudson/job/zp-owl-jar/ The jar that should be used is http://compbio.charite.de/hudson/job/zp-owl-jar/lastSuccessfulBuild/artifact/trunk/zpgen/target/zp-0.1-SNAPSHOT-jar-with-dependencies.jar

3) The zp.owl and related files are created by a different job: http://compbio.charite.de/hudson/job/zp-owl/ (Note that there is a purl /obo/hp/uberpheno/ which , I guess , was used to access zp.owl , because zp.owl was just a side-product of uberpheno.obo-generation)

Please let me know if you identify problems or have suggestions to improve this pipeline.

I will be working on zp.owl in next days, as I think the current way of representation is not optimal.

bryanlaraway commented 9 years ago

This may overlap a little with #78 , but I'm working on modeling the stages of zebrafish development for this resource. So far I'm creating the following triples:

stage_id is an individual, has_label stage_name, has_type stage_obo_id (ZFS:1234567) stage_id uberon:existence_starts_at begin_hour_id stage_id uberon:existence_ends_at end_hour_id

To get the beginning and end hour IDs, I created IDs and added the hour to the graph as an individual of type UO:0000032 (hours), with a label reflecting the hours value and units.

I believe this is correct, but if anyone wants to chime in ( @mbrush ?) on whether I'm using the correct terms, that would be appreciated.

nlwashington commented 9 years ago

this should already be in the ontology, and shouldn't need to be built in our system. @cmungall can comment.

bryanlaraway commented 9 years ago

Ah, I see, so we only need a link between the ZFIN stage ID and the ZFS term ID, the rest will be taken care of by ZFS.

Likewise for anatomy with ZFA, just need to link the ZFA term ID to the starting and ending ZFIN stage IDs.

cmungall commented 9 years ago

No, everything is taken case of by ZFA/ZFS. Just focus on the phenotypes.

Looks like ZFS doesn't yet have timings in a computable form as we do for the mouse and human in https://github.com/obophenotype/developmental-stage-ontologies - I expect @ANiknejad and @ybradford may add these at some point, but we don't have an immediate use case for this in monarch yet

drseb commented 9 years ago

Please wait before using zp.owl. There is currently a bug in detecting the old zp-ids. (https://github.com/sba1/bio-ontology-zp/issues/7)

drseb commented 9 years ago

I think this bug is fixed now. Please test before using the new zp.owl and zp.annot.

nlwashington commented 9 years ago

@mbrush if there is a large scale deletion, does it make sense to make a variant_loci? there is no locus at all...does it's absence count as a variation? i ask because we're trying to figure out if it makes sense to add a variant locus for deficiencies.

for example: Df(Chr23:acvr1ba,sp5l,wnt1,wnt10b)w5 is a deletion of the four genes listed. would we really create a variant locus for each one of those genes? or just add that the alteration is a sequence_variant_instance_of the gene without going through a variant_locus?

nlwashington commented 9 years ago

@cmungall @mellybelly how many identifiers might we want to mint? we need to make all sorts of parts of the zfin genotype partonomy, and each of the pieces could come with an identifier. i can materialize any/all/none of them. do you have a preference? these include:

effective genotype (including morphant+intrinsic) extrinsic genotype (all the morpholinos with their applied concentrations) the genomic variation complement variant loci targeted gene variants (like variant loci, but using morpholinos)

at first i was materializing them all with monarch identifiers. but now i was switching to BNodes. do you have a preference?

nlwashington commented 9 years ago

@drseb or @cmungall where do i get the new zp.annot? do i use output from a jenkins run as before? do i need to run the java commands myself?

drseb commented 9 years ago

I suggest to use the artifacts from here: http://compbio.charite.de/hudson/job/zp-owl/

Not sure if parallel pipelines exist?!

Sent from my mobile

On 12.05.2015, at 22:42, Nicole Washington notifications@github.com wrote:

@drseb or @cmungall where do i get the new zp.annot? do i use output from a jenkins run as before? do i need to run the java commands myself?

— Reply to this email directly or view it on GitHub.

nlwashington commented 9 years ago

@drseb there seem to be missing items in the zp.annot_sourceinfo mappings, reported here https://github.com/sba1/bio-ontology-zp/issues/8

nlwashington commented 9 years ago

~3K phenotype mappings still missing in https://github.com/sba1/bio-ontology-zp/issues/10

nlwashington commented 9 years ago

~250 still missing, reported in https://github.com/sba1/bio-ontology-zp/issues/11.

nlwashington commented 9 years ago

We are down to just a a few that are off... with all but one due to the use of "absent" instead of normal/abnormal in the qualifier column.