soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
149 stars 14 forks source link

High level of duplicated protein sequences #41

Closed hegardon closed 7 months ago

hegardon commented 1 year ago

Hi, I am using PLASS (v4.687d7) on a set of metagenomes from ~100 cheese samples and it works very well, but still, I have some questions. In each dataset a high level of protein sequences (on average 30%) are duplicated (with 100% identity and coverage). I understand that some sequences could be duplicated (originating from closely related species), but 30% seems to be quite high. Another issue is the total amount of assembled amino acid. As an example, for an initial dataset of 18 million reads (2x150 bp paired-end reads, 2.7 Gbp in total), 7 million proteins are assembled (2e+9 aa in total, almost as much as the total amount of nucleotides, which means, to me, more amino acid than expected...). Is there an explanation about these results ?

I am using PLASS with the following command (others parameters as default): plass assemble METAG_R1.fastq.gz METAG_R2.fastq.gz METAG_out.fasta -e 0.001 --num-iterations 12 --filter-proteins 1 --remove-tmp-files 1

Thanks Helene

milot-mirdita commented 1 year ago

Since Plass can reuse each read in every iteration. It tends to create a lot of variation that are not necessarily useful. We generally use mmseqs linclust to remove fragments afterwards.