pombase / curation

PomBase curation
7 stars 0 forks source link

Figure 1C? [2C?] #1831

Closed ValWood closed 6 years ago

ValWood commented 6 years ago

Mock up. I will update the pombe and the cerevisiae data. Antonia will prepare human data and the figure

slimming summary

ValWood commented 6 years ago

@Antonialock this is what I did, you will need to follow similar for human.

~I started with the slim set here: https://curation.pombase.org/pombase-trac/wiki/GOslims see the lists 1) standard slim with swaps and 2) added for greater coverage for "unknowns" project~ See the list below

cerevisiae Results from slimming unslimmed but annotated genes (242) I then checked to see if we missed anything in this list which is well characterised.

I figured that, largely if the SGD curators had annotated BP root node with ND that any mappings from other sources would be to fairly high level terms.

I got the list which had an ND BP root node manual and ran it through the enrichment tool to double check

subtracted it from the 'unslimmed'- this gave me a shorter list to check.

This gave me a smaller list to evaluate (119) I ran enrichement on this list (P=1 to see all annotated terms) then scanned the list to identify any terms not i) Function in process ii) response to… iii) high level (cellular process etc)

~These terms have fairly specific annotation so I will add to the list GO:0072659 protein localization to plasma membrane GO:0019413 acetate biosynthetic process GO:0009436 glyoxylate catabolic process GO:0034079 butanediol biosynthetic process (energy generation) GO:1901426 response to furfural (these are really detoxification) GO:0018890 cyanamide metabolic process (really cellular detoxification) GO:0006276 plasmid maintenance GO:2000001 regulation of DNA damage checkpoint GO:0009636 response to toxic substance YNR064C, YMR074C, YOL052C-A, YHL010C (really detoxification) GO:0071218 cellular response to misfolded protein~ double checked, all these are in

ValWood commented 6 years ago

SGD total 5915 slimmed 4900(~83%) unslimmed 794+221(1015) PomBase 5070 slimmed 4336(~85.5%) unslimmed 734=10 (744)

Note, it is slightly different from https://www.pombase.org/browse-curation/fission-yeast-go-slim-terms Protein coding genes not covered by the slim (750 in total): Gene products with biological process annotation, but not in any of the categories above: 27 Gene products with no biological process annotation: 723 because the terms are slightly more general

ValWood commented 6 years ago

I will rerun pombe and cerevisia tomorrow. Antonia can you

ValWood commented 6 years ago
ValWood commented 6 years ago

@Antonialock you mentioned that I hadn't done the instructions but they are above? Can you do the bit for human (with the additional terms we discussed, let em know if anything isn't clear) , I'm rechecking pombe and cerevisae now...

ValWood commented 6 years ago

This is the current list from https://curation.pombase.org/pombase-trac/wiki/GOslims after discounting all of the uninformative terms, and checking that nothing know is missed by enrichment.

GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0030437 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:0071554 GO:1901990 GO:0140013 GO:0006461 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0000747 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056 GO:0000920 GO:0000493 GO:0070941 GO:0007124 GO:0009305 GO:0018342 GO:0000128 GO:0034389 GO:0034276 GO:0007032 GO:0030091 GO:0018345 GO:0006797 GO:0006089 GO:0072659
GO:0019413 GO:0009436 GO:0034079 GO:1901426 GO:0018890 GO:0006276 GO:2000001 GO:0009636 GO:2000001 GO:0071218 GO:0046210

Antonialock commented 6 years ago

What slimmin tools are you using? I keep getting an error message from http://go.princeton.edu/cgi-bin/GOTermMapper

maybe I'm doing something wrong? I input the primary gene names for protein coding genes reported by HGNC (doenloaded here: https://www.genenames.org/cgi-bin/statistics )

I enter the slim terms (above + multicelllar specific terms)

I use the GOA_human_GAF downloaded from here: http://geneontology.org/page/download-go-annotations

ValWood commented 6 years ago

I don't think this will work because the file has Uniprot IDs... it also has 29082 lines which is quite a lot more than the number of human genes (that's why you are using HGNC IDs they should be a 1:1 list).

Therefore you will need to select a data option for goa_human_hgnc (this will recognise the HGNC IDs. This will seem like you are using the hgnc slim, but you aren't because you over-ride that in the advanced options. It's very confusing....

This will then use the current contents of the GO database mapped to HGNC ID set....

Antonialock commented 6 years ago

It looks like it ignores IEA and IBA annotations e.g. this gene doesn't slim https://www.ncbi.nlm.nih.gov/gene/127550

is that as expected?

ValWood commented 6 years ago

you can select the evidence codes included, are they all selected? (it includes IEA when I use it?)

ValWood commented 6 years ago

I can't see that human gene in the GO database...that's probably why. I didn't say this would be straightforward... you need to contact GO helpdesk for that one...

ValWood commented 6 years ago

actually you can't select evidence for the slimmer, I'm thinking of the enrichment tool.

It's probably because the slimmer tool isn't aware of IBA? do you have an example of a missing IEA (this gene only seems to have IBA).

If so, you will need to mail gotools and tell them to include IBA and any other codes....

Antonialock commented 6 years ago

It has glycosphingolipid biosynthetic process | IEA carbohydrate metabolic process | IEA ?

Antonialock commented 6 years ago

at least that's what's shown on the entrez gene page screen shot 2018-02-16 at 13 25 43

Antonialock commented 6 years ago

oh I see, in amigo it only has IBA. So why does entrez show IEAs? argh, so confusing

Antonialock commented 6 years ago

although... https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=U3KPV4

ValWood commented 6 years ago

Mail gotools and check which evidences (they will probably get back to you today) Mail GO and ask why human IEAs are not in the GO database. welcome to my world....

ValWood commented 6 years ago

@Antonialock an alternative is to try the QuickGO slimmer. It will work with the ID set (the reason I never use it for pombe is that we don't use UniProt IDs for GO). It will only be possible if it provides a list of "unslimmed genes".

I'm pretty sure from memory that it does because Jane and I used this when we were building the generic slim.

Antonialock commented 6 years ago

Well unfortunately the QuickGO slimming tool is broken. I sent them a message

"Hi. I'm trying to use the slimming tool but am having multiple problems https://www.ebi.ac.uk/QuickGO/slimming

I uploaded my own set of BP terms to use as the slimming set. I then wanted to slim using my own list of uniprot IDs, but got an error message saying I need to limit my own set of gene IDs to 500 I then tried to filter on the "human reference set" but got this error message: "failed to fetch REST response due to: org.springframework.web.client.HttpClientErrorException: 400 bad request""

Antonialock commented 6 years ago

Note to self

The number of human genes that we want to include is 19674

The list can be retrieved using this search: NOT existence:uncertain AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640

Removing the "existence:uncertain" drops the number of genes down from 20245

Antonialock commented 6 years ago

Ah I tried again and it "worked" using the QuickGO filter for human gene products.

Unfortunately it looks like rubbish.... In the summary of results "Your current result set contains 20,794 annotations to 1,185 distinct gene products."

I think I worked out it is because it doesn't include "regulates"

This is the list I want (includes is_a,part_of,occurs_in,regulates) but I am getting an error message: https://www.ebi.ac.uk/QuickGO/annotations?goUsage=slim&goUsageRelationships=is_a,part_of,occurs_in,regulates&goId=GO:0071941,GO:0019413,GO:0140053,GO:0140056,GO:0140013,GO:0032200,GO:0032501,GO:0032502,GO:0009636,GO:0022414,GO:0009436,GO:0007163,GO:0007155,GO:0000493,GO:0030163,GO:0042254,GO:0070647,GO:0055086,GO:0055085,GO:0055065,GO:0006091,GO:0006089,GO:0030091,GO:0005975,GO:0006276,GO:0006260,GO:0006281,GO:0006355,GO:0006351,GO:0018345,GO:0018342,GO:0006310,GO:0006325,GO:0007010,GO:1901135,GO:2000001,GO:0046210,GO:0034276,GO:0000920,GO:0000910,GO:0000278,GO:0071218,GO:0034079,GO:0034389,GO:0009305,GO:0007031,GO:0007032,GO:0007059,GO:0007005,GO:0051186,GO:0006915,GO:0006914,GO:0006913,GO:0006520,GO:0070941,GO:0098754,GO:0006955,GO:0006810,GO:0016192,GO:0006399,GO:0006486,GO:0006457,GO:0006412,GO:0016074,GO:0016073,GO:0016071,GO:0065003,GO:0006797,GO:0061024,GO:0051604,GO:0018890,GO:1901426,GO:0023052,GO:0072659,GO:1901990,GO:0006790,GO:0006766,GO:0006629&targetSet=ReferenceGenome

Antonialock commented 6 years ago

!#%!##

ValWood commented 6 years ago

I know, I know, it's all crazy... hopefully term mapper will be up to date soon.

Antonialock commented 6 years ago

so should "response to furfural" plasmid maintenance etc be discarded? They are included in your list above where you say you ahve discarded uninformative terms This is the current list from https://curation.pombase.org/pombase-trac/wiki/GOslims after discounting all of the uninformative terms, and checking that nothing know is missed by enrichment.

GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0030437 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:0071554 GO:1901990 GO:0140013 GO:0006461 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0000747 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056 GO:0000920 GO:0000493 GO:0070941 GO:0007124 GO:0009305 GO:0018342 GO:0000128 GO:0034389 GO:0034276 GO:0007032 GO:0030091 GO:0018345 GO:0006797 GO:0006089 GO:0072659 GO:0019413 GO:0009436 GO:0034079 GO:1901426 GO:0018890 GO:0006276 GO:2000001 GO:0009636 GO:2000001 GO:0071218 GO:0046210

Antonialock commented 6 years ago

do you want to plug in the huma numbers into the same graph spreadsheet that you already started? (so you get the same colours etc) here are the numbers:

total unannotated 3606
unannotated 2771
annotations to root term 616
no non-root 219
   

number of genes: | 19730 % unknown | 18.27673594

Antonialock commented 6 years ago

and annotations

GO term GO term usage in gene list
multicellular organismal process ( GO:0032501 ) 7161
signaling ( GO:0023052 ) 6427
developmental process ( GO:0032502 ) 5983
regulation of transcription, DNA-templated ( GO:0006355 ) 3579
transport ( GO:0006810 ) 3080
immune system process ( GO:0002376 ) 2920
transcription, DNA-templated ( GO:0006351 ) 2579
vesicle-mediated transport ( GO:0016192 ) 1963
apoptotic process ( GO:0006915 ) 1867
protein-containing complex assembly ( GO:0065003 ) 1660
transmembrane transport ( GO:0055085 ) 1505
lipid metabolic process ( GO:0006629 ) 1402
cell adhesion ( GO:0007155 ) 1322
reproductive process ( GO:0022414 ) 1320
cytoskeleton organization ( GO:0007010 ) 1197
carbohydrate derivative metabolic process ( GO:1901135 ) 1076
protein modification by small protein conjugation or removal ( GO:0070647 ) 1075
membrane organization ( GO:0061024 ) 897
protein catabolic process ( GO:0030163 ) 824
mRNA metabolic process ( GO:0016071 ) 812
mitotic cell cycle ( GO:0000278 ) 770
nucleobase-containing small molecule metabolic process ( GO:0055086 ) 755
inflammatory response ( GO:0006954 ) 738
chromatin organization ( GO:0006325 ) 722
translation ( GO:0006412 ) 674
carbohydrate metabolic process ( GO:0005975 ) 592
metal ion homeostasis ( GO:0055065 ) 583
mitochondrion organization ( GO:0007005 ) 564
cofactor metabolic process ( GO:0051186 ) 550
wound healing ( GO:0042060 ) 519
DNA repair ( GO:0006281 ) 505
defense response to other organism ( GO:0098542 ) 498
autophagy ( GO:0006914 ) 475
nucleocytoplasmic transport ( GO:0006913 ) 464
generation of precursor metabolites and energy ( GO:0006091 ) 460
regulation of mitotic cell cycle phase transition ( GO:1901990 ) 386
protein maturation ( GO:0051604 ) 367
cellular amino acid metabolic process ( GO:0006520 ) 359
cilium organization ( GO:0044782 ) 357
sulfur compound metabolic process ( GO:0006790 ) 356
chromosome segregation ( GO:0007059 ) 346
extracellular matrix organization ( GO:0030198 ) 333
ribosome biogenesis ( GO:0042254 ) 322
synapse organization ( GO:0050808 ) 295
protein glycosylation ( GO:0006486 ) 291
DNA replication ( GO:0006260 ) 291
DNA recombination ( GO:0006310 ) 257
microtubule-based movement ( GO:0007018 ) 235
protein localization to plasma membrane ( GO:0072659 ) 228
protein folding ( GO:0006457 ) 226
cell junction assembly ( GO:0034329 ) 203
tRNA metabolic process ( GO:0006399 ) 183
establishment or maintenance of cell polarity ( GO:0007163 ) 182
meiotic nuclear division ( GO:0140013 ) 167
organelle localization by membrane tethering ( GO:0140056 ) 160
mitochondrial gene expression ( GO:0140053 ) 147
telomere organization ( GO:0032200 ) 144
cytokinesis ( GO:0000910 ) 134
vitamin metabolic process ( GO:0006766 ) 133
collagen metabolic process ( GO:0032963 ) 115
Golgi organization ( GO:0007030 ) 108
detoxification ( GO:0098754 ) 102
snRNA metabolic process ( GO:0016073 ) 87
endosome organization ( GO:0007032 ) 75
cell redox homeostasis ( GO:0045454 ) 71
cilium movement ( GO:0003341 ) 67
lysosome localization ( GO:0032418 ) 67
lysosome organization ( GO:0007040 ) 56
protein destabilization ( GO:0031648 ) 43
chromosome condensation ( GO:0030261 ) 33
peroxisome organization ( GO:0007031 ) 31
protein palmitoylation ( GO:0018345 ) 28
cilium-dependent cell motility ( GO:0060285 ) 28
melanosome organization ( GO:0032438 ) 26
receptor localization to synapse ( GO:0097120 ) 25
catecholamine biosynthetic process ( GO:0042423 ) 20
lipid particle organization ( GO:0034389 ) 19
cell separation after cytokinesis ( GO:0000920 ) 17
protein localization to Golgi apparatus ( GO:0034067 ) 16
nitrogen cycle metabolic process ( GO:0071941 ) 15
regulation of DNA damage checkpoint ( GO:2000001 ) 15
snoRNA metabolic process ( GO:0016074 ) 14
lactate metabolic process ( GO:0006089 ) 13
ketone body metabolic process ( GO:1902224 ) 12
ethanol oxidation ( GO:0006069 ) 12
protein prenylation ( GO:0018342 ) 10
protein repair ( GO:0030091 ) 7
spermine metabolic process ( GO:0008215 ) 7
carnitine biosynthetic process ( GO:0045329 ) 5
epoxide metabolic process ( GO:0097176 ) 4
putrescine catabolic process ( GO:0009447 ) 3
glyoxylate catabolic process ( GO:0009436 ) 3
box H/ACA snoRNP assembly ( GO:0000493 ) 2
glycine betaine biosynthetic process from choline ( GO:0019285 ) 2
acetate biosynthetic process ( GO:0019413 ) 2
nitric oxide catabolic process ( GO:0046210 ) 1
protein biotinylation ( GO:0009305 ) 1
kynurenic acid biosynthetic process ( GO:0034276 ) 1
polyphosphate metabolic process ( GO:0006797 ) 1
ValWood commented 6 years ago

Looks really great!

ValWood commented 6 years ago

do you want to plug in the huma numbers into the same graph spreadsheet that you already started? (so you get the same colours etc) here are the numbers:

I don't have a spreadsheet for this, its just a crappy ppt mockup....

How do you think it will be best displayed?

ValWood commented 6 years ago

I think we need a red and a blue datapoint for March 2018 ;)

ValWood commented 6 years ago

human data and slim looks good too.....

ValWood commented 6 years ago

Just to check:

AL editing here to avoid future confusion:

total unannotated 3606
unannotated 2771 no BP at all (unannotated)
annotations to non-root term 616 has annotation to non-root (not annotated in the slim, but they had non-root annotations that were not in the slim)
no non-root 219 has root annotation (no non-root annotations)
Antonialock commented 6 years ago

if you get me the cerevisiae number I can plug them in

Antonialock commented 6 years ago

see comment within your comment for clarification of the human annotation numbers

Antonialock commented 6 years ago

known vs unknown

Antonialock commented 6 years ago

what are the 31 missing in cerevisiae @ValWood ? For now I rounded known to make to 100

ValWood commented 6 years ago

Looks brilliant! I will check the pombe and cerevisiae numbers.

ValWood commented 6 years ago

what are the 31 missing in cerevisiae

the most recent numbers above were:

SGD total 5915 slimmed 4900(~83%) unslimmed 794+221(1015) PomBase 5070 slimmed 4336(~85.5%) unslimmed 734=10 (744)

I will check them using your final slim so we use the same slim for everything. Can you send me jus the IDs as a list?

Antonialock commented 6 years ago

here slim list used for human.txt

ValWood commented 6 years ago

did you definitely use my slim with terms added? I'm sure I had slimmed things which are now not slimming?

ValWood commented 6 years ago

So I need my list + your additions for human?

Antonialock commented 6 years ago

I used the list on this page as a base https://curation.pombase.org/pombase-trac/wiki/GOslims e.g.

GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0030437 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:0071554 GO:1901990 GO:0140013 GO:0065003 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0000747 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056

Antonialock commented 6 years ago

my slim list is shown above (posted 9 days ago)

ValWood commented 6 years ago

but it excludes some of the terms in my extended slim.

Can you just send me your "additional" terms (otherwise i need to complare them one by one).

(I want to only report a single slim in the paper so I need to just add the additioanal terms you used to my extended slim...just to ensure that nothing looks odd).

ValWood commented 6 years ago

I used the list above, and some terms I used were missing. Sorry this is getting confusing...just send me list you added to my original list....

Antonialock commented 6 years ago

GO:0022414 GO:0032501 GO:0032502 GO:0002376 GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:1901990 GO:0140013 GO:0065003 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056 GO:0000920 GO:0000493 GO:0070941 GO:0009305 GO:0018342 GO:0034389 GO:0034276 GO:0007032 GO:0030091 GO:0018345 GO:0006797 GO:0006089 GO:0072659 GO:0019413 GO:0009436 GO:0034079 GO:2000001 GO:0046210 GO:0008215 GO:0060285 GO:1902224 GO:0009447 GO:0044782 GO:0098542 GO:0034329 GO:0050808 GO:0042060 GO:0045329 GO:0019285 GO:0006069 GO:0032963 GO:0030198 GO:0007030 GO:0007040 GO:0032438 GO:0034067 GO:0045454 GO:0097176 GO:0042423 GO:0031648 GO:0007018 GO:0003341 GO:0032418 GO:0030261 GO:0097120 GO:0006954

Antonialock commented 6 years ago

that is the exact list I was using

Antonialock commented 6 years ago

I took your list, and added to it

Antonialock commented 6 years ago

and removed zero annotations, e.g. flocculation? I guess some spore term,

Antonialock commented 6 years ago

but if you take your exact list (which I thought I was using? but maybe not) and subtract mine, you'll see the difference?

ValWood commented 6 years ago

I wanted to use your list, but when I used it some things weren't slimming for cerevisiae and pombe. I know I needed to add some back (cell wall stuff , flocculation etc, but I wasn't sure exactly which ones you removed.....