pantherdb / fullgo_paint_update

Update of Panther and PAINT DBs with monthly GO release data
0 stars 0 forks source link

Output UniProt identifiers in EcoCyc IBA GAF #29

Closed dustine32 closed 5 years ago

dustine32 commented 5 years ago

Background in this ticket https://github.com/geneontology/go-site/issues/969.

We are switching the ID we use output in the IBA GAFs to match those that EcoCyc uses in their export GAF.

Example of current IBA:

EcoGene EG10366 galT        GO:0005737  PMID:21873635   IBA EcoGene:EG10366|PANTHER:PTN000235304|SGD:S000000222|UniProtKB:P07902    C   Galactose-1-phosphate uridylyltransferase   UniProtKB:P09148|PTN001378447   protein taxon:83333 20181018    GO_Central

EcoGene:EG10366 is pulled from the Panther long ID ECOLI|EcoGene=EG10366|UniProtKB=P09148. We will change scripts/createGAF.pl to output the UniProtKB:P09148 instead.

dustine32 commented 5 years ago

Tagging @jimhu-tamu to make sure he's aware we're taking this action.

jimhu-tamu commented 5 years ago

Thanks.

An issue that this will make moot: EcoGene is not EcoCyc. There are some EcoCyc IDs that look like EcoGeneIDs but it should not be assumed that they're the same.

dustine32 commented 5 years ago

Ah, OK @jimhu-tamu thanks for pointing this out. These EcoGene identifiers are already mapped in Panther to the UniProtKB ID and, checking some examples in PantherDB as well as the current GO release ecocyc.gaf, the mappings look to be OK. Example from an EcoliWiki annotation:

UniProtKB       P37351  rpiB            GO:0019316      PMID:10559180   IMP             P                       gene    taxon:83333     20081017        EcoliWiki

UniProtKB:P37351 points to EcoGene:EG11827 and this mapping is observered in the Panther long ID for UniProtKB:P37351:

ECOLI|EcoGene=EG11827|UniProtKB=P37351

Are you saying, for a given EcoGene identifier, there may be an EcoCyc identifier that represents the same thing? I can't find any mappings in Panther to EcoCyc IDs.

jimhu-tamu commented 5 years ago

I don't know what the history is of the PANTHER mappings, but EcoCyc has its own set of identifiers for genes, proteins, complexes, and many other object types. EcoCyc gene ids are a mix of formats. It appears that at some time in the past there may have been an attempt to have EcoCyc and EcoGene share unique ids for genes. For example, see the page for gltJ:

https://ecocyc.org/gene?orgid=ECOLI&id=EG12661

The EcoCyc and EcoGene IDs for gltJ are both EG12661

But in other cases they don't match. See frlC:

https://ecocyc.org/gene?orgid=ECOLI&id=G7724

Here, EcoCyc uses G7724 while EcoGene uses EG12910.

I just picked a couple of entries from an old version of the EcoCyc text file we have with a listing of all the fields for genes. I guess you could check to see if there are accessions in PANTHER that are labelled EcoGene that are start with G instead of EG.

The workflow for generation of the .ecocyc gaf on our end has always been in collaboration with EcoCyc. EcoGene has been unreachable for almost a year, fwiw.

cc: @thomaspd

dustine32 commented 5 years ago

Thanks for clarifying! From what I got out of that, this means that using the UniProt ID in the GAFs is more correct than using the EcoGene?

And it looks like we don't have any EcoGene:G### IDs in Panther, they're all EcoGene:EG###. We will also look again at possibly mapping to the EcoCyc ID for the next version of Panther. It'll be nice to get @thomaspd 's thoughts on this but I won't be able to effect this change (i.e. upload corrected IBA GAFs) until next week when we get back access to our HPC, so we have some time.

cmungall commented 5 years ago

E vs EG is now moot. @jimhu-tamu has switched the ecocyc releases to go all uniprot. PAINT needs to switch to do the same. Let's do this ASAP!!!

dustine32 commented 5 years ago

@cmungall Merged the change and pushed the new EcoCyc IBA GAF for snapshot to pick up tonight. Only UniProt IDs now:

$ curl -L -s ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_ecocyc.gaf.gz | gzip -dc | grep -v ^UniProtKB
[no output (except GAF headers)]
cmungall commented 5 years ago

Thanks so much! Looks like this issue can be closed