pombase / website

PomBase website v2
MIT License
6 stars 1 forks source link

slimming tips updates #1520

Closed ValWood closed 4 years ago

ValWood commented 4 years ago

Some of this is out of date

https://www.pombase.org/browse-curation/fission-yeast-go-slimming-tips

I will suggest revisions shortly

ValWood commented 4 years ago

google doc with edits sent to Midori

ValWood commented 4 years ago

replying to @mah11 I do not have access to this document (or to google drive generally) via this email address. I would greatly prefer if you either put the text in the website ticket, or email it to me; when there are no other collaborators it doesn't make any sense to use google docs.


Hopefully you can see the edits her easily.

I use it because it is easier to see the changes.

These points will help you understand GO slims, and highlight some features of the fission yeast slim terms and annotations. Note that the counts in the “Genes” column of the GO slim table are not additive, because many gene products are annotated to multiple terms. It not possible to create a slim with no overlaps between terms. Although the S. pombe slim has been defined to include biologically informative terms, and minimise overlaps between terms, large overlaps cannot be entirely avoided. For example, most of the gene products annotated to signal transduction are also annotated other terms in the slim. As a consequence of points 1 and 2, GO slim annotation summaries should not be presented using pie charts. Although a pie chart could show the fraction of total annotations for any slim term, it can too easily be mistaken for the fraction of total annotated gene products, which is not the same. It is difficult to define a slim which includes all annotated gene products without including terms with very small numbers of annotations (for example, the peroxisome organization cell ageing branch has very few annotations), or very high- level terms which are not particularly biologically informative (e.g. cellular process). Because we have opted not to include such terms, some gene products are annotated to process terms but do not appear in the slim annotation set. Bear in mind that both proteins and RNAs can be annotated to GO terms. If you are working only with proteins you will need to make adjustments for this. For example, many tRNAs and rRNAs are annotated to cytoplasmic translation, and are therefore included in the slim set. There is a difference between “unknown” and “unannotated”. All fission yeast and budding yeast gene products have been assessed and are classed as “unknown” for biological process if no biological process information is found (experimental or inferred). If you are making comparisons with other organisms, remember that it is possible that not all gene products have been assessed and that the “unknown” set is underestimated. The default S. pombe slim includes all evidence codes for fission yeast. The evidence code IEA (inferred from electronic annotation) is often considered to be less accurate than other evidence codes, but it is very useful for increasing the coverage of some of the high level GO terms. Accurate annotation counts for some terms currently depends on including this evidence code (for example, there are 126 gene products annotated to transmembrane transport with IEA evidence, which are not yet covered by a manual annotation). For fission yeast, the IEA annotations improve slim coverage, but only represent a small number of annotations (519654 biological process annotations as of May September 202011), and have a low rate of false positives. We therefore recommend that you include them. If you are making comparisons with budding yeast (or other organisms), you should consider excluding the evidence code RCA. This evidence code is used for functional predictions, and has a very high rate of false positives (for example, including RCA for budding yeast hugely will greatly and artificially inflate the number of annotations to translation). Creating a user-defined slim You can create your own slim, or retrieve slim annotations for a gene set, using online slimming tools such as the GOTermMapper at Princeton, or the QuickGO tool at the EBI (not that you need to use UniProt, not PomBase ID’s in Quick|GO. When creating a slim for the entire genome, you should try to ensure that it covers as many annotated genes in your set as possible (see #3 in list above). You should be aware of how many genes are annotated but not in your slim, and how many are “unknown” (i.e., annotated only to the root node; see #5 in list above). For display purposes, you usually want to keep the number of terms as small as possible to convey your results. However, you should ensure that the terms you include are specific enough to capture biologically relevant information. Many terms (e.g. metabolic process (3229 2915 annotations), cellular process (4581 083 annotations)) are too general for the purpose of most slim-based analyses. On a related note, if you are using your slim for data analysis (e.g. to summarize an enrichment), you should ensure that the terms are specific enough to demonstrate their relevance to the biological topic of interest. For example, lumping all genes involved in transport my mask overrepresentation of transmembrane transport vs. underrepresentation of vesicle-mediated transport in your results set, so you need to ensure that the slim has categories to represent your results effectively. Most current implementations of software to create “GO slims” include the regulates relationship by default, so that (for example) genes involved in regulation of cytokinesis will be included with the set of genes annotated to cytokinesis. See the GO Ontology Relations documentation for further information about relationships in GO. The annotation totals presented for the PomBase default S. pombe slim, by contrast, are calculated both explicitly including or excluding the genes which are involved in a process via regulation only. We expect this distinction to be available in future versions of slimming software.

ValWood commented 4 years ago

No the edits didn't show

mah11 commented 4 years ago

Just plug in what you want as the new text; I can diff it.

ValWood commented 4 years ago

These points will help you understand GO slims, and highlight some features of the fission yeast slim terms and annotations. Note that the counts in the “Genes” column of the GO slim table are not additive, because many gene products are annotated to multiple terms. It not possible to create a slim with no overlaps between terms. Although the S. pombe slim has been defined to include biologically informative terms, and minimise overlaps between terms, large overlaps cannot be entirely avoided. For example, most of the gene products annotated to signal transduction are also annotated other terms in the slim. As a consequence of points 1 and 2, GO slim annotation summaries should not be presented using pie charts. Although a pie chart could show the fraction of total annotations for any slim term, it can too easily be mistaken for the fraction of total annotated gene products, which is not the same. It is difficult to define a slim which includes all annotated gene products without including terms with very small numbers of annotations (for example, the peroxisome organization branch has very few annotations), or very high-level terms which are not particularly biologically informative (e.g. cellular process). Because we have opted not to include such terms, some gene products are annotated to process terms but do not appear in the slim annotation set. Bear in mind that both proteins and RNAs can be annotated to GO terms. If you are working only with proteins you will need to make adjustments for this. For example, many tRNAs and rRNAs are annotated to cytoplasmic translation, and are therefore included in the slim set. There is a difference between “unknown” and “unannotated”. All fission yeast and budding yeast gene products have been assessed and are classed as “unknown” for biological process if no biological process information is found (experimental or inferred). If you are making comparisons with other organisms, remember that it is possible that not all gene products have been assessed and that the “unknown” set is underestimated. The default S. pombe slim includes all evidence codes for fission yeast. The evidence code IEA (inferred from electronic annotation) is often considered to be less accurate than other evidence codes, but it is very useful for increasing the coverage of some of the high level GO terms. Accurate annotation counts for some terms currently depends on including this evidence code (for example, there are 16 gene products annotated to transmembrane transport with IEA evidence, which are not yet covered by a manual annotation). For fission yeast, the IEA annotations improve slim coverage, but only represent a small number of annotations (519 biological process annotations as of May 2020), and have a low rate of false positives. We therefore recommend that you include them. Creating a user-defined slim You can create your own slim, or retrieve slim annotations for a gene set, using online slimming tools such as the GOTermMapper at Princeton, or the QuickGO tool at the EBI (not that you need to use UniProt, not PomBase ID’s in Quick|GO When creating a slim for the entire genome, you should try to ensure that it covers as many annotated genes in your set as possible (see #3 in list above). You should be aware of how many genes are annotated but not in your slim, and how many are “unknown” (i.e., annotated only to the root node; see #5 in list above). For display purposes, you usually want to keep the number of terms as small as possible to convey your results. However, you should ensure that the terms you include are specific enough to capture biologically relevant information. Many terms (e.g. metabolic process (3229 annotations), cellular process (4581 annotations)) are too general for the purpose of most slim-based analyses. On a related note, if you are using your slim for data analysis (e.g. to summarize an enrichment), you should ensure that the terms are specific enough to demonstrate their relevance to the biological topic of interest. For example, lumping all genes involved in transport my mask overrepresentation of transmembrane transport vs. underrepresentation of vesicle-mediated transport in your results set, so you need to ensure that the slim has categories to represent your results effectively. Most current implementations of software to create “GO slims” include the regulates relationship by default, so that (for example) genes involved in regulation of cytokinesis will be included with the set of genes annotated to cytokinesis. See the GO Ontology Relations documentation for further information about relationships in GO. The annotation totals presented for the PomBase default S. pombe slim, are calculated both explicitly including the genes which are involved in a process via regulation only. We expect this distinction to be available in future versions of slimming software.