jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
373 stars 80 forks source link

VISUALISATION #666

Closed EorgeKit closed 1 year ago

EorgeKit commented 1 year ago

Hello @fpusan is there a way to combine all the sample when visualising a coassembly run, I have a coassembly run whose sample names are very long, I want to 1: combine all the results into just one average barplot for abundance at different levels and 2: if possible when plotting all the samples , either to remove all the sample names or modify sample names before visualising

fpusan commented 1 year ago

Hi! This is possible, but a bit indirectly. You will need to use basic R to produce a matrix with the desired data. E.g. taxa in rows, the different samples (or averages of groups of samples) in columns. You can modify the column names of that matrix as you wish to control what will be shown in the plot. The raw data will be stored in SQM$taxa and SQM$functions (see the wiki or the manual for more details about the internal structure of the SQM object). Then you can call plotBars(my_new_matrix) to generate a custom barplot (or directly call ggplot2 or other plotting library yourself).

EorgeKit commented 1 year ago

Well noted, additionally, how would you suggest I perform alpha and beta diversity using the output given by sqmreads2tables.py, I seem to be confused on which file to use since there are about three for each taxonomic level: allfilter, nofilter and prokfilter. PLease advise

EorgeKit commented 1 year ago

@fpusan @jtamames @ggnatalia

fpusan commented 1 year ago

Regarding the prokfilter/allfilter difference, see this excerpt from the PDF manual

  • By default, SqueezeMeta applies Luo et al. (2014) identity cutoffs in order to assign an ORF to a given taxonomic rank (see explanation of the LCA algorithm). In our tests, these cutoffs resulted in a very low percentage of annotation for eukaryotic ORFs. To circumvent this issue, the .prokfilter. files generated by this script contain the aggregated taxonomic abundances obtained by applying Luo’s filter only to Bacteria and Archaea, but not to Eukaryotes.
  • SqueezeMeta uses NCBI’s nr database for taxonomic annotation, and reports the superkingdom, phylum, class, order, family, genus and species ranks. In some cases, the NCBI taxonomy is missing some intermediate ranks. For example, the NCBI taxonomy for the order Trichomonadida is: superkingdom: Eukaryota no rank: Parabasalia order: Trichomonadida NCBI does not assign Trichomonadida to any taxa in the class and phylum ranks. For clarity, the sqm2tables.py will indicate this by recycling the highest available taxonomy and adding the “(no in NCBI)” string after it. For example, ORFs that can be classified down to the Trichomonadida order (but are unclassified at the family level) will be reported as: superkingdom: Eukaryota phylum: Trichomonadida (no phylum in NCBI) class: Trichomonadida (no class in NCBI) order: Trichomonadida family: Unclassified Trichomonadida genus: Unclassified Trichomonadida species: Unclassified Trichomonadida

Normally the prokfilter tables are a safe bet. For diversity analyses I would use the abundance tables, probably excluding the Unclassified taxa. For alpha diversity I tend to use Shannon, for beta diversity I use CLR + euclidean distances. However note that the best statistic method will depend on how your data looks like and what you want to achieve.

EorgeKit commented 1 year ago

Thanks alot @fpusan , I have managed to the R manipulation and plotted the samples in combined form, however, I have tried to search for ways out there that make use of abundance to calculate alpha and beta diversity and I can not find any, most use the count which I do not think we have in our results, if you have any scripts that you use to analyze the squeeze-meta results with respect to alpha and beta diversity please fell free to share. Thankyou in advance

fpusan commented 1 year ago

We do have the counts in our results. They are the tankes named abund

Kauthar-Omar commented 1 year ago

Hi @fpusan I also have the same issue with performing alpha and beta diversity with my output results of sqm_reads2tables. It sounds easy to do until you realize many tools out there use output formats from qiime2 or dada2, if you have any insights on how to go about the two with squeezemeta results it would really be helpful for us the squeezemeta community haha, I am on a time sensitive project and I have already spent some time trying to figure out how to do it with little results, so pleaaaaase help. We appreciate all your efforts so far , thanks

fpusan commented 1 year ago

Hi, In R, a good package for doing these analyses is vegan. The tables we have in SQMtools (e.g. SQM$taxa$genus$abund or SQM$functions$KEGG$abund) are compatible with that, you only need to transpose them. phyloseq can also be an option, our tables are also good inputs to that package. QIIME and DADA2 deal with 16S data, SQMtools deals with metagenomes, so we can 't easily go from one to the other.