monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

Fix issue parsing gene ids from interactors in BIOGRID mi.txt file #925

Closed justaddcoffee closed 4 years ago

justaddcoffee commented 4 years ago

Some interactors (col1 and col2 of BIOGRID-ALL-3.5.184.mitab.txt) are now biogrid CURIEs (e.g. biogrid:1221377) instead of gene IDs (e.g. entrez gene/locuslink:3638344):

(venv) ~/PycharmProjects/dipper *fix_biogrid_id_parsing_issue $ cut -f2 raw/biogrid/BIOGRID-ALL-3.5.184.mitab.txt | grep -v "ID" | grep -v locuslink | head -5
biogrid:1220045
biogrid:1226198
biogrid:1222768
biogrid:1222768
biogrid:1228539

This PR changes the code to accept these biogrid CURIEs as gene IDs (and also breaks out the logic for this decision into a method, and tests the method in tests/test_biogrid.py

TomConlin commented 4 years ago

looks very good.

aside: If they internally use the lowercase curie-prefix then that is what we would expect to find in the wild then at some point we should consider doing what they do too.

TomConlin commented 4 years ago

notes: It looks as this ingest has not seen many of the standard refactors most other have So thanks for fixing the extra stuff you spotted on the way

around line 233; the global logging instance for this ingest is 'LOG.' for log messages, pep depreciates .format() strings in flavor of printf % syntax.

Does running make as we do to test, regenerate Julie's alternative curie-prefix yaml file? If so please include it in this PR.

On the import typing... there is no point. shiny always wins.

justaddcoffee commented 4 years ago

notes: It looks as this ingest has not seen many of the standard refactors most other have So thanks for fixing the extra stuff you spotted on the way

around line 233; the global logging instance for this ingest is 'LOG.' for log messages, pep depreciates .format() strings in flavor of printf % syntax.

Does running make as we do to test, regenerate Julie's alternative curie-prefix yaml file? If so please include it in this PR.

On the import typing... there is no point. shiny always wins.

Does running make as we do to test, regenerate Julie's alternative curie-prefix yaml file? If so please include it in this PR.

All done, except the make/curie-prefix yaml thing - not sure what this, or how this relates to the BioGrid fix?

TomConlin commented 4 years ago

bottom of the Makefile https://github.com/monarch-initiative/dipper/blob/master/Makefile#L133

Part of Julies identifier work includes collecting various alternative curie-prefixes so when a non canonical one is used (say in a search)
it can be translated to a canonical one.

She began with just a manual list but we augment it with any we find in the local translation tables.

You added a local translation for an alternative curie-prefix you found in the wild "UNIPROT_PRO": "UniprotKB" and it should be automatically added to the list by running make then committing any regenerated files that show up under translationtable/generated/ That alternative might already be there from some previous contact in which case nothing would happen.

just checked : https://github.com/monarch-initiative/dipper/blob/master/translationtable/generated/prefix_equivalents.yaml

The alternative you found should trigger an update

justaddcoffee commented 4 years ago

That Makefile no longer runs on my machine, but when I manually run this command: https://github.com/monarch-initiative/dipper/blob/master/Makefile#L133 no files change.

This command does actually change a file (prefix_equivalents.yaml): https://github.com/monarch-initiative/dipper/blob/master/Makefile#L140 so I've committed this change.

TomConlin commented 4 years ago

I would suspect it never really did, but it was never intended to and no one has the resources to address it, so the best I can do is assure you that if you ran on any other dipper developer machine, any monarch server, our Travis or Jenkins instances, it would/should.

for example when on tartini if I type make -n

the botton of the result is

awk -F '"' '/^"[^"]+": "[^":]+".*/\
    {if($2 != $4 && ! match($2, /[0-9]+/))\
        print "\"" $4 "\"\t\"" $2 "\""}' \
            translationtable/wormbase.yaml translationtable/mpd.yaml translationtable/monochrom.yaml translationtable/animalqtldb.yaml translationtable/flybase.yaml translationtable/string.yaml translationtable/orphanet.yaml translationtable/mmrrc.yaml translationtable/rgd.yaml translationtable/ensembl.yaml translationtable/bgee.yaml translationtable/zfinslim.yaml translationtable/mgislim.yaml translationtable/hgnc.yaml translationtable/impc.yaml translationtable/go.yaml translationtable/ebi_g2p.yaml translationtable/kegg.yaml translationtable/coriell.yaml translationtable/panther.yaml translationtable/clinvar.yaml translationtable/omia.yaml translationtable/gwascatalog.yaml translationtable/eom.yaml translationtable/mychem.yaml translationtable/hpoa.yaml translationtable/omim.yaml translationtable/mmrc.yaml translationtable/ctd.yaml translationtable/reactome.yaml translationtable/mgi.yaml translationtable/zfin.yaml translationtable/udp.yaml translationtable/genereviews.yaml translationtable/monarch.yaml translationtable/biogrid.yaml translationtable/alternate_curie_prefix.yaml translationtable/ucscbands.yaml translationtable/ncbigene.yaml translationtable/sgd.yaml translationtable/decipher.yaml | LC_ALL=C sort -u > /tmp/local_inverse.tab
echo "---\n# prefix_equivalents.yaml" > translationtable/generated/prefix_equivalents.yaml;
LC_ALL=C join translationtable/generated/curiemap_prefix.txt /tmp/local_inverse.tab|\
awk '{v=$1;$1="";print substr($0,2) ": " v}' | sort -u >> translationtable/generated/prefix_equivalents.yaml

which at least appears would generate translationtable/generated/prefix_equivalents.yaml

Thanks for making the effort to generate and commit the file, it is greatly appreciated.

... especially since it brought in more than you changed ... which means something is amiss ... which means I better check it

So thanks again.

ahh, not a make/script problem. not missing new mappings, just existing mappings rearranged.

They will get corrected next time anyone else runs it.

TomConlin commented 4 years ago

However the alternative mapping is still not there.

The reason is the discrepancy between what is in curie_map.yaml https://github.com/monarch-initiative/dipper/blob/master/dipper/curie_map.yaml#L231 and what has been added.

justaddcoffee commented 4 years ago

Okay, not sure I can help any more with this Biogrid fix. I'm going to close this PR, open it on Monarch's dipper repo, and you can push whatever fixes are necessary to the YAML files. Does that suit you?

TomConlin commented 4 years ago

addressing uniprotkb curie-prefix is minimal. leaving regenerating the prefix file is probably not worth your effort.

justaddcoffee commented 4 years ago

addressing uniprotkb curie-prefix is minimal. leaving regenerating the prefix file is probably not worth your effort.

Okay, push whatever changes you think are necessary to this or the other PR (#927)

justaddcoffee commented 4 years ago

Handing this off to you @TomConlin