pombase / curation

PomBase curation
7 stars 0 forks source link

Slim for Histograms #2066

Closed ValWood closed 6 years ago

ValWood commented 6 years ago

I made a new slim compilation file

https://docs.google.com/document/d/1LVu3e8R2GmqS63KhAcjSbkBMngJ-y0Nd8uyBuWw0gFk/edit

I'm not sure that I got everything in it. This is the master list.

These are.

  1. The pombe slim terms (excluding replaced terms in 3.)
  2. Terms added to cover a pombe unslimmed (also pulls in S. cerevisiae unslimmed)
  3. The replacement terms made less specific to cover cerevisiae and human
  4. Terms added to cover S. cerevisiae
  5. Terms added to cover human

We need to test that this slim covers everything expected for

(it probably won't), than record any changes made need ot be recorded in this ticket.

I need this record so that we use the same term set for every organism, and I can improve it easily over time if I know the reason that each term was added.)

For example we will need to break down "multicellular organism process" into biological meaningful modules without losing anything

Antonialock commented 6 years ago

this is the annotation set I used

screen shot 2018-06-29 at 08 02 08

Antonialock commented 6 years ago

I'm not sure why that in UniProt proteins get such detailed descriptions but no GO terms....

That's what I meant in a previous email where I said that "i'm sure that some of the unknowns are known, just poorly annotated"

ValWood commented 6 years ago

I know this might seem odd, but when I said use HGNC names, I did mean names (symbols) not iDs. The reason for this is that the names are unique, and these are now used as the primary identifiers in GO eg.

A3GALT2 http://amigo.geneontology.org/amigo/gene_product/UniProtKB:U3KPV4

If we use these, trouble shouting will be much easier as we can switch the different ID set options.

Could you repeat using symbols, and searching agains the GOA database. Some of the issues reported above might go away......

And we need to find out exactly what those options are searching behind the scenes. My impression was that they search the same dataset, but allowed the use of different ID sets. Anyway it isn't clear.....

ValWood commented 6 years ago

The pull down isn't specifying the "annotation set" . The annotation set is what is in the GO database. The pull-down is specifying the slim + which ID set you are using. We are overriding the slim with our own slim, so all this is doing (or should be), is allowing us to input a specific set of IDs

So, if everything is working properly, I would expect each pull down to give the same results.

Antonialock commented 6 years ago

HGNC:14352 https://www.uniprot.org/uniprot/Q96GR2 long-chain fatty acid metabolic process Source: HGNC long-chain fatty-acyl-CoA biosynthetic process should slim?

yes (at least) to

sulfur compound metabolic process cofactor metabolic process

ValWood commented 6 years ago

Yeah there is something really weird. It's like when things didn't slim over "transmembrane transport" but they did slim over "transport"

were they definitely, definitely definitely annotated to TM transport? Maybe they just had a "transmembrane transporter" MF term and a "transport" BP..... If not this should be reported as a bug.....

Antonialock commented 6 years ago

ok I'll have a stab at gene symbols

ValWood commented 6 years ago

HGNC:14352 https://www.uniprot.org/uniprot/Q96GR2 long-chain fatty acid metabolic process Source: HGNC long-chain fatty-acyl-CoA biosynthetic process should slim? yes (at least) to

that's odd then. That implies that the different ID sets are searching different datasets. We really need to know what they are searching, so we need to ask Mark about that....

(I was searching with the HGNC:ID and the goa_human generic GO slim)

Also I think quickgo use GOA when they mean GO, I will report that.....

Antonialock commented 6 years ago

were they definitely, definitely definitely annotated to TM transport?

see my comment above

for some reason lots of child terms to "transmembrane transport" don't slim...e.g. L-glutamate transmembrane transport shows up...it shouldn't do! HGNC:16703, HGNC:10945 are annotated to this AND the term has parentage to transmembrane transport... lots of other examples too

Antonialock commented 6 years ago

but they seem to slim over "transport"

Antonialock commented 6 years ago

Also I think quickgo use GOA when they mean GO, I will report that.....

I don't know if this has anything to do with it but uniprot annotation is often very different from quicgo annotation

Antonialock commented 6 years ago

e.g.

quickgo shows annotations to spermatid and oocyte development (these annotations are also in the GO GAF)

but these annotations are not shown by uniprot https://www.uniprot.org/uniprot/Q9H6S0

ValWood commented 6 years ago

I don't know if this has anything to do with it but uniprot annotation is often very different from quicgo annotation

can you give me an example where they are different. Do you mean the annotation is different , or the results (this will because of the time lag).

The annotation we are searching against using GO term mapper is what is in the GO database, right now, or should be.

ValWood commented 6 years ago

the transmembrane transporter issue could be a time lag as well. But the parentage was fixed in february so I would like to look into that too.

Antonialock commented 6 years ago

can you give me an example where they are different. Do you mean the annotation is different , or the results (this will because of the time lag).

see my comment just above your comment :-)

Antonialock commented 6 years ago

When I use gene symbols I get:

These 15 identifiers were found to be ambiguous: ATP6AP2 CALM1 EIF3F GABARAP HIST1H2AI HOXD4 IDS KLK9 MED17 MUC21 NSG1 PI4K2B SUPT3H TMSB15B TRAPPC2L

These 2416 identifiers were found to be unannotated: UNANNOTATED USING GENE SYMBOLS AND 102 slim terms.txt

These 523 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim: WITH NON ROOT ANNOTATIONS USING GENE SYMBOLS AND 102 slim terms.txt

These 190 identifiers had no non-root annotations: NO NON-ROOT USING GENE SYMBOLS AND 102 slim terms.txt

ValWood commented 6 years ago

e.g.

quickgo shows annotations to spermatid and oocyte development (these annotations are also in the GO GAF) but these annotations are not shown by uniprot https://www.uniprot.org/uniprot/Q9H6S0

OK Uniprot is behind QuickGO. The data in quickgo is int he GO database. GO term mapper should be searching the GO databases.

If you can see a term annotated in the GO database http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9H6S0 and it is not being slimmed, this is a bug, and should be reported to GO term mapper. There is likely a problem with updating or datasets.

ValWood commented 6 years ago

This is quite a new annotation. Can you ask when GO term mapper was updated?

GO_REF:0000024 | 20180515

it should have filtered through by now......

ValWood commented 6 years ago

Closing https://github.com/pombase/curation/issues/2194 https://github.com/pombase/curation/issues/2193 anything else here?