Closed ValWood closed 6 years ago
@Antonialock this is what I did, you will need to follow similar for human.
~I started with the slim set here: https://curation.pombase.org/pombase-trac/wiki/GOslims see the lists 1) standard slim with swaps and 2) added for greater coverage for "unknowns" project~ See the list below
cerevisiae Results from slimming unslimmed but annotated genes (242) I then checked to see if we missed anything in this list which is well characterised.
I figured that, largely if the SGD curators had annotated BP root node with ND that any mappings from other sources would be to fairly high level terms.
I got the list which had an ND BP root node manual and ran it through the enrichment tool to double check
subtracted it from the 'unslimmed'- this gave me a shorter list to check.
This gave me a smaller list to evaluate (119) I ran enrichement on this list (P=1 to see all annotated terms) then scanned the list to identify any terms not i) Function in process ii) response to… iii) high level (cellular process etc)
~These terms have fairly specific annotation so I will add to the list GO:0072659 protein localization to plasma membrane GO:0019413 acetate biosynthetic process GO:0009436 glyoxylate catabolic process GO:0034079 butanediol biosynthetic process (energy generation) GO:1901426 response to furfural (these are really detoxification) GO:0018890 cyanamide metabolic process (really cellular detoxification) GO:0006276 plasmid maintenance GO:2000001 regulation of DNA damage checkpoint GO:0009636 response to toxic substance YNR064C, YMR074C, YOL052C-A, YHL010C (really detoxification) GO:0071218 cellular response to misfolded protein~ double checked, all these are in
SGD total 5915 slimmed 4900(~83%) unslimmed 794+221(1015) PomBase 5070 slimmed 4336(~85.5%) unslimmed 734=10 (744)
Note, it is slightly different from https://www.pombase.org/browse-curation/fission-yeast-go-slim-terms Protein coding genes not covered by the slim (750 in total): Gene products with biological process annotation, but not in any of the categories above: 27 Gene products with no biological process annotation: 723 because the terms are slightly more general
I will rerun pombe and cerevisia tomorrow. Antonia can you
@Antonialock you mentioned that I hadn't done the instructions but they are above? Can you do the bit for human (with the additional terms we discussed, let em know if anything isn't clear) , I'm rechecking pombe and cerevisae now...
This is the current list from https://curation.pombase.org/pombase-trac/wiki/GOslims after discounting all of the uninformative terms, and checking that nothing know is missed by enrichment.
GO:0140053
GO:0000278
GO:0006810
GO:0007010
GO:0006412
GO:0007031
GO:0030437
GO:0023052
GO:0006520
GO:0032200
GO:0016074
GO:0005975
GO:0070647
GO:0007059
GO:0030163
GO:0055086
GO:0006351
GO:0006260
GO:0071554
GO:1901990
GO:0140013
GO:0006461
GO:0071941
GO:0006355
GO:0006399
GO:0042254
GO:0006457
GO:0006486
GO:0016071
GO:0007005
GO:0006310
GO:1901135
GO:0000747
GO:0006913
GO:0006091
GO:0006914
GO:0098754
GO:0016192
GO:0051186
GO:0007163
GO:0061024
GO:0006629
GO:0006281
GO:0000910
GO:0051604
GO:0007155
GO:0055085
GO:0006766
GO:0006325
GO:0016073
GO:0006915
GO:0006790
GO:0055065
GO:0140056
GO:0000920
GO:0000493
GO:0070941
GO:0007124
GO:0009305
GO:0018342
GO:0000128
GO:0034389
GO:0034276
GO:0007032
GO:0030091
GO:0018345
GO:0006797
GO:0006089
GO:0072659
GO:0019413
GO:0009436
GO:0034079
GO:1901426
GO:0018890
GO:0006276
GO:2000001
GO:0009636
GO:2000001
GO:0071218
GO:0046210
What slimmin tools are you using? I keep getting an error message from http://go.princeton.edu/cgi-bin/GOTermMapper
maybe I'm doing something wrong? I input the primary gene names for protein coding genes reported by HGNC (doenloaded here: https://www.genenames.org/cgi-bin/statistics )
I enter the slim terms (above + multicelllar specific terms)
I use the GOA_human_GAF downloaded from here: http://geneontology.org/page/download-go-annotations
I don't think this will work because the file has Uniprot IDs... it also has 29082 lines which is quite a lot more than the number of human genes (that's why you are using HGNC IDs they should be a 1:1 list).
Therefore you will need to select a data option for goa_human_hgnc (this will recognise the HGNC IDs. This will seem like you are using the hgnc slim, but you aren't because you over-ride that in the advanced options. It's very confusing....
This will then use the current contents of the GO database mapped to HGNC ID set....
It looks like it ignores IEA and IBA annotations e.g. this gene doesn't slim https://www.ncbi.nlm.nih.gov/gene/127550
is that as expected?
you can select the evidence codes included, are they all selected? (it includes IEA when I use it?)
I can't see that human gene in the GO database...that's probably why. I didn't say this would be straightforward... you need to contact GO helpdesk for that one...
actually you can't select evidence for the slimmer, I'm thinking of the enrichment tool.
It's probably because the slimmer tool isn't aware of IBA? do you have an example of a missing IEA (this gene only seems to have IBA).
If so, you will need to mail gotools and tell them to include IBA and any other codes....
It has glycosphingolipid biosynthetic process | IEA carbohydrate metabolic process | IEA ?
at least that's what's shown on the entrez gene page
oh I see, in amigo it only has IBA. So why does entrez show IEAs? argh, so confusing
Mail gotools and check which evidences (they will probably get back to you today) Mail GO and ask why human IEAs are not in the GO database. welcome to my world....
@Antonialock an alternative is to try the QuickGO slimmer. It will work with the ID set (the reason I never use it for pombe is that we don't use UniProt IDs for GO). It will only be possible if it provides a list of "unslimmed genes".
I'm pretty sure from memory that it does because Jane and I used this when we were building the generic slim.
Well unfortunately the QuickGO slimming tool is broken. I sent them a message
"Hi. I'm trying to use the slimming tool but am having multiple problems https://www.ebi.ac.uk/QuickGO/slimming
I uploaded my own set of BP terms to use as the slimming set. I then wanted to slim using my own list of uniprot IDs, but got an error message saying I need to limit my own set of gene IDs to 500 I then tried to filter on the "human reference set" but got this error message: "failed to fetch REST response due to: org.springframework.web.client.HttpClientErrorException: 400 bad request""
Note to self
The number of human genes that we want to include is 19674
The list can be retrieved using this search: NOT existence:uncertain AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640
Removing the "existence:uncertain" drops the number of genes down from 20245
Ah I tried again and it "worked" using the QuickGO filter for human gene products.
Unfortunately it looks like rubbish.... In the summary of results "Your current result set contains 20,794 annotations to 1,185 distinct gene products."
I think I worked out it is because it doesn't include "regulates"
!#%!##
I know, I know, it's all crazy... hopefully term mapper will be up to date soon.
so should "response to furfural" plasmid maintenance etc be discarded? They are included in your list above where you say you ahve discarded uninformative terms This is the current list from https://curation.pombase.org/pombase-trac/wiki/GOslims after discounting all of the uninformative terms, and checking that nothing know is missed by enrichment.
GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0030437 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:0071554 GO:1901990 GO:0140013 GO:0006461 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0000747 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056 GO:0000920 GO:0000493 GO:0070941 GO:0007124 GO:0009305 GO:0018342 GO:0000128 GO:0034389 GO:0034276 GO:0007032 GO:0030091 GO:0018345 GO:0006797 GO:0006089 GO:0072659 GO:0019413 GO:0009436 GO:0034079 GO:1901426 GO:0018890 GO:0006276 GO:2000001 GO:0009636 GO:2000001 GO:0071218 GO:0046210
do you want to plug in the huma numbers into the same graph spreadsheet that you already started? (so you get the same colours etc) here are the numbers:
total unannotated | 3606 |
---|---|
unannotated | 2771 |
annotations to root term | 616 |
no non-root | 219 |
number of genes: | 19730 % unknown | 18.27673594
and annotations
GO term | GO term usage in gene list |
---|---|
multicellular organismal process ( GO:0032501 ) | 7161 |
signaling ( GO:0023052 ) | 6427 |
developmental process ( GO:0032502 ) | 5983 |
regulation of transcription, DNA-templated ( GO:0006355 ) | 3579 |
transport ( GO:0006810 ) | 3080 |
immune system process ( GO:0002376 ) | 2920 |
transcription, DNA-templated ( GO:0006351 ) | 2579 |
vesicle-mediated transport ( GO:0016192 ) | 1963 |
apoptotic process ( GO:0006915 ) | 1867 |
protein-containing complex assembly ( GO:0065003 ) | 1660 |
transmembrane transport ( GO:0055085 ) | 1505 |
lipid metabolic process ( GO:0006629 ) | 1402 |
cell adhesion ( GO:0007155 ) | 1322 |
reproductive process ( GO:0022414 ) | 1320 |
cytoskeleton organization ( GO:0007010 ) | 1197 |
carbohydrate derivative metabolic process ( GO:1901135 ) | 1076 |
protein modification by small protein conjugation or removal ( GO:0070647 ) | 1075 |
membrane organization ( GO:0061024 ) | 897 |
protein catabolic process ( GO:0030163 ) | 824 |
mRNA metabolic process ( GO:0016071 ) | 812 |
mitotic cell cycle ( GO:0000278 ) | 770 |
nucleobase-containing small molecule metabolic process ( GO:0055086 ) | 755 |
inflammatory response ( GO:0006954 ) | 738 |
chromatin organization ( GO:0006325 ) | 722 |
translation ( GO:0006412 ) | 674 |
carbohydrate metabolic process ( GO:0005975 ) | 592 |
metal ion homeostasis ( GO:0055065 ) | 583 |
mitochondrion organization ( GO:0007005 ) | 564 |
cofactor metabolic process ( GO:0051186 ) | 550 |
wound healing ( GO:0042060 ) | 519 |
DNA repair ( GO:0006281 ) | 505 |
defense response to other organism ( GO:0098542 ) | 498 |
autophagy ( GO:0006914 ) | 475 |
nucleocytoplasmic transport ( GO:0006913 ) | 464 |
generation of precursor metabolites and energy ( GO:0006091 ) | 460 |
regulation of mitotic cell cycle phase transition ( GO:1901990 ) | 386 |
protein maturation ( GO:0051604 ) | 367 |
cellular amino acid metabolic process ( GO:0006520 ) | 359 |
cilium organization ( GO:0044782 ) | 357 |
sulfur compound metabolic process ( GO:0006790 ) | 356 |
chromosome segregation ( GO:0007059 ) | 346 |
extracellular matrix organization ( GO:0030198 ) | 333 |
ribosome biogenesis ( GO:0042254 ) | 322 |
synapse organization ( GO:0050808 ) | 295 |
protein glycosylation ( GO:0006486 ) | 291 |
DNA replication ( GO:0006260 ) | 291 |
DNA recombination ( GO:0006310 ) | 257 |
microtubule-based movement ( GO:0007018 ) | 235 |
protein localization to plasma membrane ( GO:0072659 ) | 228 |
protein folding ( GO:0006457 ) | 226 |
cell junction assembly ( GO:0034329 ) | 203 |
tRNA metabolic process ( GO:0006399 ) | 183 |
establishment or maintenance of cell polarity ( GO:0007163 ) | 182 |
meiotic nuclear division ( GO:0140013 ) | 167 |
organelle localization by membrane tethering ( GO:0140056 ) | 160 |
mitochondrial gene expression ( GO:0140053 ) | 147 |
telomere organization ( GO:0032200 ) | 144 |
cytokinesis ( GO:0000910 ) | 134 |
vitamin metabolic process ( GO:0006766 ) | 133 |
collagen metabolic process ( GO:0032963 ) | 115 |
Golgi organization ( GO:0007030 ) | 108 |
detoxification ( GO:0098754 ) | 102 |
snRNA metabolic process ( GO:0016073 ) | 87 |
endosome organization ( GO:0007032 ) | 75 |
cell redox homeostasis ( GO:0045454 ) | 71 |
cilium movement ( GO:0003341 ) | 67 |
lysosome localization ( GO:0032418 ) | 67 |
lysosome organization ( GO:0007040 ) | 56 |
protein destabilization ( GO:0031648 ) | 43 |
chromosome condensation ( GO:0030261 ) | 33 |
peroxisome organization ( GO:0007031 ) | 31 |
protein palmitoylation ( GO:0018345 ) | 28 |
cilium-dependent cell motility ( GO:0060285 ) | 28 |
melanosome organization ( GO:0032438 ) | 26 |
receptor localization to synapse ( GO:0097120 ) | 25 |
catecholamine biosynthetic process ( GO:0042423 ) | 20 |
lipid particle organization ( GO:0034389 ) | 19 |
cell separation after cytokinesis ( GO:0000920 ) | 17 |
protein localization to Golgi apparatus ( GO:0034067 ) | 16 |
nitrogen cycle metabolic process ( GO:0071941 ) | 15 |
regulation of DNA damage checkpoint ( GO:2000001 ) | 15 |
snoRNA metabolic process ( GO:0016074 ) | 14 |
lactate metabolic process ( GO:0006089 ) | 13 |
ketone body metabolic process ( GO:1902224 ) | 12 |
ethanol oxidation ( GO:0006069 ) | 12 |
protein prenylation ( GO:0018342 ) | 10 |
protein repair ( GO:0030091 ) | 7 |
spermine metabolic process ( GO:0008215 ) | 7 |
carnitine biosynthetic process ( GO:0045329 ) | 5 |
epoxide metabolic process ( GO:0097176 ) | 4 |
putrescine catabolic process ( GO:0009447 ) | 3 |
glyoxylate catabolic process ( GO:0009436 ) | 3 |
box H/ACA snoRNP assembly ( GO:0000493 ) | 2 |
glycine betaine biosynthetic process from choline ( GO:0019285 ) | 2 |
acetate biosynthetic process ( GO:0019413 ) | 2 |
nitric oxide catabolic process ( GO:0046210 ) | 1 |
protein biotinylation ( GO:0009305 ) | 1 |
kynurenic acid biosynthetic process ( GO:0034276 ) | 1 |
polyphosphate metabolic process ( GO:0006797 ) | 1 |
Looks really great!
do you want to plug in the huma numbers into the same graph spreadsheet that you already started? (so you get the same colours etc) here are the numbers:
I don't have a spreadsheet for this, its just a crappy ppt mockup....
How do you think it will be best displayed?
I think we need a red and a blue datapoint for March 2018 ;)
human data and slim looks good too.....
Just to check:
AL editing here to avoid future confusion:
total unannotated | 3606 |
---|---|
unannotated | 2771 no BP at all (unannotated) |
annotations to non-root term | 616 has annotation to non-root (not annotated in the slim, but they had non-root annotations that were not in the slim) |
no non-root | 219 has root annotation (no non-root annotations) |
if you get me the cerevisiae number I can plug them in
see comment within your comment for clarification of the human annotation numbers
what are the 31 missing in cerevisiae @ValWood ? For now I rounded known to make to 100
Looks brilliant! I will check the pombe and cerevisiae numbers.
what are the 31 missing in cerevisiae
the most recent numbers above were:
SGD total 5915 slimmed 4900(~83%) unslimmed 794+221(1015) PomBase 5070 slimmed 4336(~85.5%) unslimmed 734=10 (744)
I will check them using your final slim so we use the same slim for everything. Can you send me jus the IDs as a list?
did you definitely use my slim with terms added? I'm sure I had slimmed things which are now not slimming?
So I need my list + your additions for human?
I used the list on this page as a base https://curation.pombase.org/pombase-trac/wiki/GOslims e.g.
GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0030437 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:0071554 GO:1901990 GO:0140013 GO:0065003 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0000747 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056
my slim list is shown above (posted 9 days ago)
but it excludes some of the terms in my extended slim.
Can you just send me your "additional" terms (otherwise i need to complare them one by one).
(I want to only report a single slim in the paper so I need to just add the additioanal terms you used to my extended slim...just to ensure that nothing looks odd).
I used the list above, and some terms I used were missing. Sorry this is getting confusing...just send me list you added to my original list....
GO:0022414 GO:0032501 GO:0032502 GO:0002376 GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:1901990 GO:0140013 GO:0065003 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056 GO:0000920 GO:0000493 GO:0070941 GO:0009305 GO:0018342 GO:0034389 GO:0034276 GO:0007032 GO:0030091 GO:0018345 GO:0006797 GO:0006089 GO:0072659 GO:0019413 GO:0009436 GO:0034079 GO:2000001 GO:0046210 GO:0008215 GO:0060285 GO:1902224 GO:0009447 GO:0044782 GO:0098542 GO:0034329 GO:0050808 GO:0042060 GO:0045329 GO:0019285 GO:0006069 GO:0032963 GO:0030198 GO:0007030 GO:0007040 GO:0032438 GO:0034067 GO:0045454 GO:0097176 GO:0042423 GO:0031648 GO:0007018 GO:0003341 GO:0032418 GO:0030261 GO:0097120 GO:0006954
that is the exact list I was using
I took your list, and added to it
and removed zero annotations, e.g. flocculation? I guess some spore term,
but if you take your exact list (which I thought I was using? but maybe not) and subtract mine, you'll see the difference?
I wanted to use your list, but when I used it some things weren't slimming for cerevisiae and pombe. I know I needed to add some back (cell wall stuff , flocculation etc, but I wasn't sure exactly which ones you removed.....
Mock up. I will update the pombe and the cerevisiae data. Antonia will prepare human data and the figure