monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Investigate dropping g2p data from flybase #716

Closed kshefchek closed 5 years ago

kshefchek commented 5 years ago

The dipper build on the FB2018_06 ftp://ftp.flybase.net/releases/FB2018_06/ release resulted in a large decrease in gene to phenotype data:

Drosophila melanogaster: 88953 (-126592)

For the FB2019_01 build the only phenotype left is FBcv_0001347, which is not a meaningful association:

Drosophila melanogaster: 11374 (-204171)

grep -oP 'FBcv_\d+' flybase.nt | sort -u
FBcv_0001347
TomConlin commented 5 years ago

I'm away now but would start with: https://github.com/monarch-initiative/dipper/blob/master/dipper/sources/FlyBase.py#L855

kshefchek commented 5 years ago

I ran it with 41a5e2d and the data looked better, so I suspect a translation table or bnode pruning update may have broken something. In any case, this ingest is in need of a rewrite or overhaul.

kshefchek commented 5 years ago

Fixed with https://github.com/monarch-initiative/dipper/pull/760