monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

add wormbase data #166

Closed nlwashington closed 9 years ago

nlwashington commented 9 years ago

we need to pull in the wormbase geno-pheno data:

ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/

rnai phenotypes: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/rnai_phenotypes.WS249.wb ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/rnai_phenotypes_quick.WS249.wb

other phenotypes: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/phenotype_association.WS249.wb

genes in development: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/development_association.WS249.wb or in anatomical location: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/anatomy_association.WS249.wb

feature locations: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS249.annotations.gff3.gz

xrefs: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS249.xrefs.txt.gz

papers we had to get elsewhere.

gene ids: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS249.geneIDs.txt.gz really nice (prose) descriptions of genes: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS249.functional_descriptions.txt.gz

orthologs: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS249.orthologs.txt.gz

gene interactions: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS249.gene_interactions.txt.gz

nlwashington commented 9 years ago

also, wbpaper xrefs: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=WpaXref

nlwashington commented 9 years ago

note that previously (in disco) we got additional phenotype annotation files from wormmine, that were only partially overlapping with those in the static files above:

ANNOTATION_FILE="http://www.wormbase.org/tools/wormmine/service/query/results?format=tab&start=0&query=%3Cquery+model%3D%22genomic%22+view%3D%22Allele.primaryIdentifier+Allele.symbol+Allele.naturalVariant+Allele.method+Allele.gene.primaryIdentifier+Allele.gene.secondaryIdentifier+Allele.gene.symbol+Allele.gene.chromosome.primaryIdentifier%22+sortOrder%3D%22Allele.primaryIdentifier+ASC%22+%3E%3Cjoin+path%3D%22Allele.gene%22+style%3D%22OUTER%22%2F%3E%3Cjoin+path%3D%22Allele.gene.chromosome%22+style%3D%22OUTER%22%2F%3E%3C%2Fquery%3E"
wget -nv -O wb_allele.txt $ANNOTATION_FILE
check_errs $? "wget error"
log "Downloading annotation file"
ANNOTATION_FILE="http://www.wormbase.org/tools/wormmine/service/query/results?format=tab&start=0&query=%3Cquery+model%3D%22genomic%22+view%3D%22BioEntity.primaryIdentifier+BioEntity.symbol+BioEntity.phenotypesObserved.identifier+BioEntity.phenotypesObserved.name%22+sortOrder%3D%22BioEntity.primaryIdentifier+ASC%22+%3E%3C%2Fquery%3E"
wget -nv -O wb_extra_variant_phenotypes.txt $ANNOTATION_FILE
check_errs $? "wget error"
log "Downloading annotation file"
ANNOTATION_FILE="http://www.wormbase.org/tools/wormmine/service/query/results?format=tab&start=0&query=%3Cquery+model%3D%22genomic%22+view%3D%22BioEntity.primaryIdentifier+BioEntity.symbol+BioEntity.phenotypesNotObserved.identifier+BioEntity.phenotypesNotObserved.name%22+sortOrder%3D%22BioEntity.primaryIdentifier+ASC%22+%3E%3C%2Fquery%3E"
wget -nv -O wb_phenotypes_not_observed.txt $ANNOTATION_FILE
check_errs $? "wget error"

note that there is also a wormmine python api

nlwashington commented 9 years ago

also, the KO alleles: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/$BIOPROJECT/annotation/c_elegans*.knockout_consortium_alleles.xml.gz

nlwashington commented 9 years ago

FWIW, we also scrubbed the following from gff3:

for col2 in Allele Mos_insertion_allele ; do grep -P "\t$col2\t" c_elegans.annotations.gff3 ; done > allele_dump.gff3
for col2 in Coding_transcript Genomic_canonical Non_coding_transcript Orfeome Promoterome Pseudogene RNAi_primary RNAi_secondary Reference Transposon Transposon_CDS cDNA_for_RNAi miRanda ncRNA operon polyA_
signal_sequence polyA_site snlRNA ; do grep -P "\t$col2\t" c_elegans.annotations.gff3 ; done > genomic_feat_dump.gff3
nlwashington commented 9 years ago

first pass on this is done using just the data from ftp (not from wormmine), but not including the orthology, interaction, or expression data.

nlwashington commented 9 years ago

the data in wormmine appears to be 2+ years old; i am hesitant to add the data from that page.

nlwashington commented 9 years ago

the task of adding interaction data is moved to ticket #214 .

nlwashington commented 9 years ago

the bulk of wormbase is finished, and includes:

what is to be done in the future (when we can deal with it) is: genes in development: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/development_association.WS249.wb or in anatomical location: ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/anatomy_association.WS249.wb

jmcmurry commented 9 years ago

Wooooot!

mellybelly commented 9 years ago

congrats!