High level of duplicated protein sequences

Hi, I am using PLASS (v4.687d7) on a set of metagenomes from ~100 cheese samples and it works very well, but still, I have some questions. In each dataset a high level of protein sequences (on average 30%) are duplicated (with 100% identity and coverage). I understand that some sequences could be duplicated (originating from closely related species), but 30% seems to be quite high. Another issue is the total amount of assembled amino acid. As an example, for an initial dataset of 18 million reads (2x150 bp paired-end reads, 2.7 Gbp in total), 7 million proteins are assembled (2e+9 aa in total, almost as much as the total amount of nucleotides, which means, to me, more amino acid than expected...). Is there an explanation about these results ?

I am using PLASS with the following command (others parameters as default): plass assemble METAG_R1.fastq.gz METAG_R2.fastq.gz METAG_out.fasta -e 0.001 --num-iterations 12 --filter-proteins 1 --remove-tmp-files 1

Thanks Helene

soedinglab / plass

High level of duplicated protein sequences #41