monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Odd looking genes in GO Annotation ingest #191

Closed kevinschaper closed 2 years ago

kevinschaper commented 2 years ago

I was just looking at what's happening with mapping in the GO Annotation ingest, and I don’t understand what I’m seeing - it seems like maybe we’re doing something very weird.

UniProtKB       A0A2R8QCY5      si:rp71-68n21.9 part_of GO:0031463      PMID:21873635   IBA     PANTHER:PTN002900184|UniProtKB:Q9P2J3|UniProtKB:Q9P2N7  C       Uncharacterized protein UniProtKB:A0A2R8QCY5|PTN008142927        protein taxon:7955      20211208        GO_Central 

is getting turned into

uuid:ce5c0bdc-a491-11ec-8b95-22a9ff2458b8       NCBIGene:UniRef100_A0A2R8QCY5   biolink:part_of GO:0031463      biolink:FunctionalAssociation|biolink:Association|biolink:MacromolecularMachineToCellularComponentAssociation   BFO:0000050            infores:goa

and I’m not sure what NCBIGene:UniRef100_A0A2R8QCY5 is, but it seems like we’re mapping wrong?

The big mapping file has:

A0A2R8QCY5      A0A2R8QCY5_DANRE                                                UniRef100_A0A2R8QCY5    UniRef90_A0A7J6CP25     UniRef50_Q9P2N7 UPI000D19406A           7955                    23594743        CU151884        -     ENSDARG00000116080       ENSDART00000180086      ENSDARP00000151458    
kevinschaper commented 2 years ago

This is the command that makes the small version of the mapping file:

gzcat ./data/goa/uniprot_2_gene.tab.gz | awk 'BEGIN {OFS=\"\t\"} ($7==10090 || $7==10116 || $7==162425 || $7==44689 || $7==6239 || $7==7227 || $7==7955 || $7==9031 || $7==9606 || $7==9615 || $7==9823 || $7==9913) {print $1,$3}' | pigz > ./data/goa/uniprot_2_entrez.tab.gz

I think I messed up by not specifying the input field separator for awk.

kevinschaper commented 2 years ago

Nice to know what was going on, but with #192 we won't actually need this fix!