Open aleixop opened 3 years ago
--min-length
regulates the min frame size, which will be considered for assembly. It does not affect the reported assembly length.
Thanks for the quick response! Then, which is the minimum reported length by default and why are there empty sequences? Shouldn't these be filtered out prior to writing the final assembly?
Does it report a FASTA header and then no residues after the header? We shouldn't be able to produce 0-length sequences, I am not sure why that would happen. Do you have a small(er) set to reproduce this behavior?
We should probably also add a parameter to filter out sequences under a given length.
I already introduced a parameter for that in the nucleassemble command: --min-contig-len
We could simply adopt it, or?
I also ran a small example where I could reproduce the error of getting sequences of length 0, header:
I only had a short look at the code so far, but it seems something with the db output goes wrong:
filtercoding can produce sequences of length 0 when it finds non coding sequences, but this is not the actual problem, because the ‘only assembled’ filter should throw such sequences away afterwards anyway. BUT there seems to be a problem with newlines in the sequence db file outputed in the current version, confusing the awk command for checking start and stop codons in the following, if there exits more than one sequences per line. In this way sometimes the wrong sequences are sorted out and therefore we sometimes keep the sequences of length 0.
I did not figure out the reason for the missing newlines in the db output, @martin-steinegger @milot-mirdita can one of you check that? It is not the case for all but only for some sequences. Or is that even desired, but then it still does not fit with the awk command for finding start and stop codons.
Expected Behavior
Do not write sequences in the output shorter than
--min-length
, which is 45 aa by default.Current Behavior
Sequences shorter than
--min-length
are being written in the output, even empty ones (len:0).Steps to Reproduce (for bugs)
Plass Output (for bugs)
General output: https://gist.github.com/aleixop/76bd8e2fc4e9a88ba7072f470abbc600
Context
Co-assembly of ~300M PE reads with default parameters that runs smoothly without errors. 250 Gb of RAM and 48 cpus.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
5d03cce371dc51c23652a251550c33fd0358690d