pombase / curation

PomBase curation
7 stars 0 forks source link

final things for histogram figure #2016

Closed ValWood closed 6 years ago

ValWood commented 6 years ago

from https://github.com/pombase/curation/issues/1960

I wanted to re-run the pombe data with the gaf from 28 April, ex- PAINT data. I have now done that.

The results are: 707 unknown 0 unannotated 4363 known (this is slightly higher than out slim 9 only by ~10) because we introduced a few more terms to this slim)

@Antonialock could you do the final figure?

The current one looks like this: https://drive.google.com/drive/folders/0B0YtE_BqXTzQU19vaDBabkZIdVE

We need to change this slightly. We need to distinguish

known unknown unannotated (we don't want a category "unknowns not covered by the slim")

I want to call the ones which do not slim, but have a process "unknown" for this purpose. We have assessed the "unslimmed" and if we thought the annotation was meaningful we added a term to the slim so these are really "unknown", even if they have a BP annotation.

There might be a small number that could be classified with a slim extension (7 pombe genes have a BP but do not map to this slim). This is good enough....otherwise it will never end.

I checked all of the SGD ones multiple times and there is nothing informative in there. You did the same for human.

ValWood commented 6 years ago

Also, I think this fig would look better with

  1. no horizontal lines
  2. More contrasting colours (but not too primary, I loathe primary colours :)
  3. Unannoted at the top, then unknown, then known
Antonialock commented 6 years ago

This should be quick to do..

..the 3 categories might be problematic though. I don't think I could make a confident distinction between 'unknown' and 'unannotated' for human? (without putting in a years worth of work..)

ValWood commented 6 years ago

But that's the point of the slim...

unknown is unannotated plus unslimmed

We aren't responsible for checking that the "unannotated" are "unknown" (that is why we specify "unannotated", but UniProt and MGI prioritise "unannotated" so it is a good proxy....

ValWood commented 6 years ago

i.e it isn't our job to distinguish. Unannotated is unannotated?

Antonialock commented 6 years ago

So anything : -unslimmed but annotated to non-root or annotated to root node = unknown

Would a gene with an IEA to phosphorylation be considered ‘annotated’?

On Sat, 26 May 2018 at 01:29, Val Wood notifications@github.com wrote:

i.e it isn't our job to distinguish. Unannotated is unannotated?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pombase/curation/issues/2016#issuecomment-392220362, or mute the thread https://github.com/notifications/unsubscribe-auth/AMI00ib5vbMkJm8yfRB7Xt7uaO30Mq4tks5t2KHygaJpZM4UOecn .

-- Antonia Lock, PhD PomBase Biocurator, http://www.pombase.org Department of Genetics, Evolution and Environment, The Darwin Building, University College London London WC1E 6BT, UK

ValWood commented 6 years ago

yes, IEA to phosphorylation is "unknown" for this purpose.

So, you went through the human unknowns manually, and enriched them to check that you didn't miss anything major when you made the extensions to the slim?

You should have 3 numbers from the slim analysis (we don't want to rerun it as I have all the datasets on the google drive from 28th April)

You got a list of unannotated form the output (un annotated) A list which did not slim at all (ND = unknown 1) A list that that have BP terms, but do not map to the slim (we will add this to the unknowns because we want to class the "response to" "phosphorylation" etc as "unknown process" . We made sure that there are no genes in this list which should be slimmed? (unknown 2) The slimmed set (known).

ValWood commented 6 years ago

there is also another ticket about this: https://github.com/pombase/curation/issues/1986

ValWood commented 6 years ago

@Antonialock could you do this first this week?

Antonialock commented 6 years ago

I had to regenerate the numbers. It got really confusing we had different number of genes, there was a lot of discussion on how to filter transposons etc. did a small adjustment to the slim. will update all numbers. The current numbers are documented in the excel file which lives in the unknowns dropbox folder.

Antonialock commented 6 years ago

known vs unknown graph

Antonialock commented 6 years ago

slim terms used:

GO:0000278 GO:0000493 GO:0000910 GO:0000920 GO:0002376 GO:0005975 GO:0006069 GO:0006089 GO:0006091 GO:0006260 GO:0006281 GO:0006310 GO:0006325 GO:0006351 GO:0006355 GO:0006399 GO:0006412 GO:0006457 GO:0006486 GO:0006501 GO:0006520 GO:0006629 GO:0006766 GO:0006790 GO:0006797 GO:0006810 GO:0006913 GO:0006914 GO:0006915 GO:0006954 GO:0007005 GO:0007010 GO:0007018 GO:0007029 GO:0007030 GO:0007031 GO:0007032 GO:0007040 GO:0007059 GO:0007155 GO:0007163 GO:0008215 GO:0009305 GO:0009436 GO:0009447 GO:0009636 GO:0016071 GO:0016073 GO:0016074 GO:0018342 GO:0018345 GO:0019285 GO:0019413 GO:0022414 GO:0023052 GO:0030091 GO:0030163 GO:0030198 GO:0030261 GO:0031648 GO:0032200 GO:0032418 GO:0032438 GO:0032501 GO:0032502 GO:0032963 GO:0034067 GO:0034079 GO:0034276 GO:0034329 GO:0034389 GO:0042060 GO:0042254 GO:0042423 GO:0043647 GO:0044782 GO:0045329 GO:0045454 GO:0046210 GO:0050808 GO:0051186 GO:0051604 GO:0055065 GO:0055086 GO:0048870 GO:0061024 GO:0065003 GO:0070647 GO:0071218 GO:0071554 GO:0071941 GO:0072659 GO:0097120 GO:0097176 GO:0098542 GO:0098754 GO:0140053 GO:0140056 GO:1901135 GO:1901990 GO:1902224 GO:2000001 GO:0012501 GO:0070265 GO:0007624

Antonialock commented 6 years ago

protein list used matches:

NOT existence:uncertain NOT keyword:"Transposable element [KW-0814]" AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640

With the addition of: P60508 Q9UQF0 Q86TG7 Q9N2K0 Q9N2J8 M5A8F1

= 19704 genes

ValWood commented 6 years ago

did a small adjustment to the slim.

Nooooo...... what did you change? We were sorted with the terms already. I will need to change all of the supp files and slim outputs for pombe and cerevisiae so everything will need re-checking.

S. cerevisiae doesn't have any unannotated?

ValWood commented 6 years ago

(we don't want to rerun it as I have all the datasets on the google drive from 28th April)

ValWood commented 6 years ago

These 1 identifiers were found to be unannotated: YCL054W-A

ValWood commented 6 years ago

These are the 28 April pombe/cerevisiae numbers from the old ticket: theses are all 'unknown' not unannotated.

pombe unknowns 1+16+662 = 679 cerevisiae unknowns 1+168+765 = 934

I was overlooking the "1" unknown because it was some GO f***-up (they should not really be classed unannotated, but it was as close as I could get it).

I really, really, really don't want to have to do this again because I need to account for the PAINT data etc.....

I think now the dates will not align anymore we will just have to report a different date for human. The other possibility is to get the April GAF....

can discuss after the call this morning.

Antonialock commented 6 years ago

I got really confused by your above comments:

  1. “unknown is unannotated plus unslimmed”
  2. “ unannotated is unannotated”

So on the bar graph you want 3 categories: Unannotated Unknown Known

But if “unknown” is unannotated+unslimmed Then it doesn’t make sense for unannotated to have its own category?? It should all add up to 100%

Antonialock commented 6 years ago

Also I think different dates is ok as long as we report the dates.

I changed the slim because there were a few proteins with annotation that looked good so I added in some specific terms to catch these.

ValWood commented 6 years ago

The slim output gives 3 numbers

707 unknown (unslimmed + have an BP annotation but not to a BP process term, these are all unknown becasue we checked them all) 0 unannotated 4363 known

"unannotated" only exist for human (we think these are most likely 'unknown' or they would be annotated (at least by IEA or ISS from mouse), we can't be 100% so we just call them "unannotated".

(they are a subset of unknown)

It's really simple (and that seems to be what you did in the graph above, except that you didn't use my numbers for cerevisiae, and I don't know where you got the "unknown" from (there is only one, and it's an error)

ValWood commented 6 years ago

I think you are misunderstanding what I am adding together ?

ValWood commented 6 years ago

You need to tell me specifically which terms you added, in the ticket. I will need to check if this pulls in anything from the pombe and cerevisiae unknowns, change slim total numbers in manuscript, make sure that the slim is correct in the supp directory etc. etc. etc.

Which terms did you add?

I

Antonialock commented 6 years ago

Ok. Yes that makes sense. I wasn’t sure I understood your original description.

So here when you say ‘unknown’ you don’t mean the graph category unknown, you mean unknown in general? “unknown is unannotated plus unslimmed”

I’m not sure about unannotated for human. Some have really good text descriptions but the go annotation doesn’t echo the description. Others have loads of GO annotation that looks all over the place (is it meaningful or just a grab bag of shit). If you give a gene enough go terms at random sooner or later one will slim.

ValWood commented 6 years ago

I’m not sure about unannotated for human. Some have really good text descriptions but the go annotation doesn’t echo the description. Others have loads of GO annotation that looks all over the place (is it meaningful or just a grab bag of shit). If you give a gene enough go terms at random sooner or later one will slim.

this shouldn't be the case. Unannotated is supposed to mean NO GO annotation AT all to the aspect. I just checked this and there is a problem.

Back to the GO helpdesk I'm afraid. I'll open a new ticket to try to describe the issue. I think it's a recurrence of an older issue. Sigh.

ValWood commented 6 years ago

replaced by https://github.com/pombase/curation/issues/2051