soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
149 stars 14 forks source link

Empty (len:0) sequences in plass output #33

Open aleixop opened 3 years ago

aleixop commented 3 years ago

Expected Behavior

Do not write sequences in the output shorter than --min-length, which is 45 aa by default.

Current Behavior

Sequences shorter than --min-length are being written in the output, even empty ones (len:0).

Steps to Reproduce (for bugs)

plass \
assemble \
 ${FW} \
 ${RV} \
 ${FASTA_OUT} \
 ${TMP_DIR}

Plass Output (for bugs)

General output: https://gist.github.com/aleixop/76bd8e2fc4e9a88ba7072f470abbc600

Context

Co-assembly of ~300M PE reads with default parameters that runs smoothly without errors. 250 Gb of RAM and 48 cpus.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

martin-steinegger commented 3 years ago

--min-length regulates the min frame size, which will be considered for assembly. It does not affect the reported assembly length.

aleixop commented 3 years ago

Thanks for the quick response! Then, which is the minimum reported length by default and why are there empty sequences? Shouldn't these be filtered out prior to writing the final assembly?

milot-mirdita commented 3 years ago

Does it report a FASTA header and then no residues after the header? We shouldn't be able to produce 0-length sequences, I am not sure why that would happen. Do you have a small(er) set to reproduce this behavior?

We should probably also add a parameter to filter out sequences under a given length.

AnnSeidel commented 3 years ago

I already introduced a parameter for that in the nucleassemble command: --min-contig-len

We could simply adopt it, or?

I also ran a small example where I could reproduce the error of getting sequences of length 0, header: len:0 sequence:\0\n

I only had a short look at the code so far, but it seems something with the db output goes wrong:

filtercoding can produce sequences of length 0 when it finds non coding sequences, but this is not the actual problem, because the ‘only assembled’ filter should throw such sequences away afterwards anyway. BUT there seems to be a problem with newlines in the sequence db file outputed in the current version, confusing the awk command for checking start and stop codons in the following, if there exits more than one sequences per line. In this way sometimes the wrong sequences are sorted out and therefore we sometimes keep the sequences of length 0.

I did not figure out the reason for the missing newlines in the db output, @martin-steinegger @milot-mirdita can one of you check that? It is not the case for all but only for some sequences. Or is that even desired, but then it still does not fit with the awk command for finding start and stop codons.