Closed ValWood closed 6 years ago
this is the annotation set I used
I'm not sure why that in UniProt proteins get such detailed descriptions but no GO terms....
That's what I meant in a previous email where I said that "i'm sure that some of the unknowns are known, just poorly annotated"
I know this might seem odd, but when I said use HGNC names, I did mean names (symbols) not iDs. The reason for this is that the names are unique, and these are now used as the primary identifiers in GO eg.
A3GALT2 http://amigo.geneontology.org/amigo/gene_product/UniProtKB:U3KPV4
If we use these, trouble shouting will be much easier as we can switch the different ID set options.
Could you repeat using symbols, and searching agains the GOA database. Some of the issues reported above might go away......
And we need to find out exactly what those options are searching behind the scenes. My impression was that they search the same dataset, but allowed the use of different ID sets. Anyway it isn't clear.....
The pull down isn't specifying the "annotation set" . The annotation set is what is in the GO database. The pull-down is specifying the slim + which ID set you are using. We are overriding the slim with our own slim, so all this is doing (or should be), is allowing us to input a specific set of IDs
So, if everything is working properly, I would expect each pull down to give the same results.
HGNC:14352 https://www.uniprot.org/uniprot/Q96GR2 long-chain fatty acid metabolic process Source: HGNC long-chain fatty-acyl-CoA biosynthetic process should slim?
yes (at least) to
sulfur compound metabolic process cofactor metabolic process
Yeah there is something really weird. It's like when things didn't slim over "transmembrane transport" but they did slim over "transport"
were they definitely, definitely definitely annotated to TM transport? Maybe they just had a "transmembrane transporter" MF term and a "transport" BP..... If not this should be reported as a bug.....
ok I'll have a stab at gene symbols
HGNC:14352 https://www.uniprot.org/uniprot/Q96GR2 long-chain fatty acid metabolic process Source: HGNC long-chain fatty-acyl-CoA biosynthetic process should slim? yes (at least) to
that's odd then. That implies that the different ID sets are searching different datasets. We really need to know what they are searching, so we need to ask Mark about that....
(I was searching with the HGNC:ID and the goa_human generic GO slim)
Also I think quickgo use GOA when they mean GO, I will report that.....
were they definitely, definitely definitely annotated to TM transport?
see my comment above
for some reason lots of child terms to "transmembrane transport" don't slim...e.g. L-glutamate transmembrane transport shows up...it shouldn't do! HGNC:16703, HGNC:10945 are annotated to this AND the term has parentage to transmembrane transport... lots of other examples too
but they seem to slim over "transport"
Also I think quickgo use GOA when they mean GO, I will report that.....
I don't know if this has anything to do with it but uniprot annotation is often very different from quicgo annotation
e.g.
quickgo shows annotations to spermatid and oocyte development (these annotations are also in the GO GAF)
but these annotations are not shown by uniprot https://www.uniprot.org/uniprot/Q9H6S0
I don't know if this has anything to do with it but uniprot annotation is often very different from quicgo annotation
can you give me an example where they are different. Do you mean the annotation is different , or the results (this will because of the time lag).
The annotation we are searching against using GO term mapper is what is in the GO database, right now, or should be.
the transmembrane transporter issue could be a time lag as well. But the parentage was fixed in february so I would like to look into that too.
can you give me an example where they are different. Do you mean the annotation is different , or the results (this will because of the time lag).
see my comment just above your comment :-)
When I use gene symbols I get:
These 15 identifiers were found to be ambiguous: ATP6AP2 CALM1 EIF3F GABARAP HIST1H2AI HOXD4 IDS KLK9 MED17 MUC21 NSG1 PI4K2B SUPT3H TMSB15B TRAPPC2L
These 2416 identifiers were found to be unannotated: UNANNOTATED USING GENE SYMBOLS AND 102 slim terms.txt
These 523 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim: WITH NON ROOT ANNOTATIONS USING GENE SYMBOLS AND 102 slim terms.txt
These 190 identifiers had no non-root annotations: NO NON-ROOT USING GENE SYMBOLS AND 102 slim terms.txt
e.g.
quickgo shows annotations to spermatid and oocyte development (these annotations are also in the GO GAF) but these annotations are not shown by uniprot https://www.uniprot.org/uniprot/Q9H6S0
OK Uniprot is behind QuickGO. The data in quickgo is int he GO database. GO term mapper should be searching the GO databases.
If you can see a term annotated in the GO database http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9H6S0 and it is not being slimmed, this is a bug, and should be reported to GO term mapper. There is likely a problem with updating or datasets.
This is quite a new annotation. Can you ask when GO term mapper was updated?
GO_REF:0000024 | 20180515
it should have filtered through by now......
Closing https://github.com/pombase/curation/issues/2194 https://github.com/pombase/curation/issues/2193 anything else here?
I made a new slim compilation file
https://docs.google.com/document/d/1LVu3e8R2GmqS63KhAcjSbkBMngJ-y0Nd8uyBuWw0gFk/edit
I'm not sure that I got everything in it. This is the master list.
These are.
We need to test that this slim covers everything expected for
(it probably won't), than record any changes made need ot be recorded in this ticket.
I need this record so that we use the same term set for every organism, and I can improve it easily over time if I know the reason that each term was added.)
For example we will need to break down "multicellular organism process" into biological meaningful modules without losing anything