jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
373 stars 80 forks source link

combining tables of bins from unique folders #646

Closed croth204 closed 1 year ago

croth204 commented 1 year ago

Hi, I was running the SqueezeMeta pipeline to create bins of each sample. All of them were running sperarately and by trying to combine the tables of each sample from a separate folder finally to just one file, the programm was complaining about:

"COG0468 (RecA/RadA) was not present in your data. This is weird, as RecA should be universal, so you probably just skipped COG annotation. Skipping copy number calculation..."

by the usage of the combine-sqm-tables.py command.
Is there a way to combine the created bin tables ?

Thank you :)

fpusan commented 1 year ago

Hi,

No, there is no way to combine the bin tables at the moment. Combining different projects work only at the taxonomy/function levels.

The reason for this is the following. If the same species is present in more than one sample and we run SqueezeMeta separately for each sample, we will get one copy of the ORF/contig/bin per sample. Just merging this will result in the same feature being present many times on each sample.

In order to avoid this, you need to collapse/derreplicate your data. This is not currently implemented in SqueezeMeta and you will need to do it by yourself. If you are interested in combining bins from different projects and analyzing them together, you will need to:

1) Collect all the fasta files for the bins from the results/bins directory of every project. Be careful since you may have the same file name for bins from different projects. 2) Cluster all your bins into species (the standard ANI threshold would be 95%), and dereplicate each species. You can use something like drep, which will produce a single representative genome for each species. Or you can use mOTUlizer + SuperPang to produce a non-redundant pangenome assembly for each species. In any case you will end up having a fasta file for each species, that has no duplicated regions. 3) Rename the contigs on each file so contig names are unique and have no weird characters (ideally you want to use only characters and underscores). Create a new fasta file concatenating the sequenes from all the species. 4) Re-run SqueezeMeta again using this concatenated fasta file as an external assembly, and adding all your samples to the sample file. This will re-annotate all the contigs and estimate their abundances in the different samples. Add the flags --norename (to keep the names as they are) and -test 13 (to make SqueezeMeta stop before binning). 5) Now manually create the directory results/bins in your new project, and add the individual files containing the contigs for the different species. This way you are tricking SqueezeMeta into thinking that each of your species is a bin. 6) Restart SqueezeMeta from step 16 and let it finish. 7) Now you can load the project into SQMtools. The "bins" section of the SQM object will contain the information (abundance, coverage etc) of each of your species in the different samples. You can subset individual species using subsetBins. If you have further information about the different contigs (e.g. which ones are core or accessory, in case you ran SuperPang) you can further subset your data based on this.

I know this is a handful, I actually plan on implementing this feature in a future version of SqueezeMeta and writing a decent tutorial on how to use the software to perform species and intra-species level analysis, but I haven't found the time (and I may not in the near future).

However I have done this myself a couple times and the procedure works. See e.g. https://www.biorxiv.org/content/10.1101/2022.03.25.485477v1.full

timyerg commented 1 year ago

HI! I am trying to implement the steps above (thank you for such detailed explanations!) and have a question regarding step 2. I would like to run mOTUlizer + SuperPang option instead of drep. If I understood correctly, the input should be collected bins from all the projects as their are (except of renaming bins so the project is added). Or should I use mOTUlizer before SuperPang to create mOTUs? Thank you in advance

fpusan commented 1 year ago

You need to run motulizer first, then run SuperPang independently for every mOTU/species. This will give you one pangenome assembly per species, that you can then combine as describe above.

timyerg commented 1 year ago

Dear @fpusan, thank you for the instructions. Finally it is done and bins/mOTUs now are pooled among the samples with taxonomy annotations and TPM/Coverage data. Issue can be closed (it is the same dataset as @croth204).

fpusan commented 1 year ago

Glad to hear! I will try to find time and write some scripts to automate this