Closed dustine32 closed 5 years ago
Tagging @jimhu-tamu to make sure he's aware we're taking this action.
Thanks.
An issue that this will make moot: EcoGene is not EcoCyc. There are some EcoCyc IDs that look like EcoGeneIDs but it should not be assumed that they're the same.
Ah, OK @jimhu-tamu thanks for pointing this out. These EcoGene
identifiers are already mapped in Panther to the UniProtKB ID and, checking some examples in PantherDB as well as the current GO release ecocyc.gaf, the mappings look to be OK. Example from an EcoliWiki
annotation:
UniProtKB P37351 rpiB GO:0019316 PMID:10559180 IMP P gene taxon:83333 20081017 EcoliWiki
UniProtKB:P37351
points to EcoGene:EG11827
and this mapping is observered in the Panther long ID for UniProtKB:P37351
:
ECOLI|EcoGene=EG11827|UniProtKB=P37351
Are you saying, for a given EcoGene
identifier, there may be an EcoCyc
identifier that represents the same thing? I can't find any mappings in Panther to EcoCyc
IDs.
I don't know what the history is of the PANTHER mappings, but EcoCyc has its own set of identifiers for genes, proteins, complexes, and many other object types. EcoCyc gene ids are a mix of formats. It appears that at some time in the past there may have been an attempt to have EcoCyc and EcoGene share unique ids for genes. For example, see the page for gltJ:
https://ecocyc.org/gene?orgid=ECOLI&id=EG12661
The EcoCyc and EcoGene IDs for gltJ are both EG12661
But in other cases they don't match. See frlC:
https://ecocyc.org/gene?orgid=ECOLI&id=G7724
Here, EcoCyc uses G7724 while EcoGene uses EG12910.
I just picked a couple of entries from an old version of the EcoCyc text file we have with a listing of all the fields for genes. I guess you could check to see if there are accessions in PANTHER that are labelled EcoGene that are start with G instead of EG.
The workflow for generation of the .ecocyc gaf on our end has always been in collaboration with EcoCyc. EcoGene has been unreachable for almost a year, fwiw.
cc: @thomaspd
Thanks for clarifying! From what I got out of that, this means that using the UniProt ID in the GAFs is more correct than using the EcoGene?
And it looks like we don't have any EcoGene:G###
IDs in Panther, they're all EcoGene:EG###
. We will also look again at possibly mapping to the EcoCyc ID for the next version of Panther. It'll be nice to get @thomaspd 's thoughts on this but I won't be able to effect this change (i.e. upload corrected IBA GAFs) until next week when we get back access to our HPC, so we have some time.
E vs EG is now moot. @jimhu-tamu has switched the ecocyc releases to go all uniprot. PAINT needs to switch to do the same. Let's do this ASAP!!!
@cmungall Merged the change and pushed the new EcoCyc IBA GAF for snapshot
to pick up tonight. Only UniProt IDs now:
$ curl -L -s ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_ecocyc.gaf.gz | gzip -dc | grep -v ^UniProtKB
[no output (except GAF headers)]
Thanks so much! Looks like this issue can be closed
Background in this ticket https://github.com/geneontology/go-site/issues/969.
We are switching the ID we use output in the IBA GAFs to match those that EcoCyc uses in their export GAF.
Example of current IBA:
EcoGene:EG10366
is pulled from the Panther long IDECOLI|EcoGene=EG10366|UniProtKB=P09148
. We will change scripts/createGAF.pl to output theUniProtKB:P09148
instead.