soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
145 stars 13 forks source link

Some Issues about the length of protein sequences #31

Open susutBu opened 3 years ago

susutBu commented 3 years ago

Hi there, Here I'd like to thanks for this excellent tool for assemble short read sequencing data on a protein level, it improved the utilization of reads to a large extent. When I used the plass assemble , some question puzzled me. Firstly, when I used the --min-length to control the length of residues of output. Unfortunatly the output is empty, despite the value is 100. Then, when I checked the length of output, I found that the length of many residue larger than 5000 residues, which seems abnormal. How can we prevent this from happening? The command I used to assemble as follows: plass assemble --threads 32 --min-seq-id 0.99 clean_reads/ERR_YZYC_1.fastq clean_reads/ERR_YZYC_2.fastq ERR_YZYC_assembly.fas ERR_YZYCt Plass Version: c4aaa9803a5073b256decd60e15f4d64774e16fc

Insitu_prot_4747305 len:8946 Insitu_prot_4748790 len:5383 Insitu_prot_4882950 len:3398 ......

milot-mirdita commented 3 years ago

The --min-length parameter is inherited from MMseqs2 and even there very confusingly named. It controls the lengths of ORFs that are extracted for assembly. You should only change it if you have very short reads (i.e with 75bp reads, reduce it to maybe 25 or even less).

The parameter you want is probably --min-contig-len. This parameter rejects after assembly all contains that are too short.

No idea about the super long proteins though. Could you post the sequences?

susutBu commented 3 years ago

Thank you for your reply. Firstly, I think the parameter I need is not --min-contig-len, becuase the command I used is plass assemble. The --min-contig-len belongs to plass nuclassemble. I want to use the parameter to control the lengths of ORFs, just like described in your nature methodpaper, "We ignored all proteins shorter than 100 residues", this is no controled by --min-contig-len ?And the reads used for protein assembled are 2 × 150 bp pair-end sequences, I think is noomal. The attach file is the squences of super long proteins Insitu_plass99_5k.txt