rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
535 stars 132 forks source link

GFA reports contig with length 0 #315

Open SMorrison42 opened 1 year ago

SMorrison42 commented 1 year ago

Hi, I'm using Unicycler in my pipeline and using the gfa python package to parse the .gfa file unicycler produces to assist with assessment of the final assembly. I've noticed in a few of my .gfa files it will report a contig length 0 with no sequence. What is the parameter I should use to remove contig length 0 from the gfa report?

mikeyweigand commented 1 year ago

Here's a specific example, output from assembly of Illumina NovaSeq reads from monkeypox virus. Note segment 12 has no sequence (length = 0 bp).

cat L*A13.assembly.gfa | grep ^S | cut -f1,2,4,5 | column -t
S  1   LN:i:151789  dp:f:1.0
S  2   LN:i:14767   dp:f:0.9675450839707439
S  3   LN:i:11567   dp:f:0.959489622964646
S  4   LN:i:5777    dp:f:0.9215721186123447
S  5   LN:i:4676    dp:f:1.9657676335501226
S  6   LN:i:1624    dp:f:2.023590772062553
S  7   LN:i:268     dp:f:2.1558008320217863
S  8   LN:i:24      dp:f:1.6570576416128175
S  9   LN:i:16      dp:f:20.642039801793622
S  10  LN:i:9       dp:f:27.790311595329598
S  11  LN:i:2       dp:f:0.6247140887830516
S  12  LN:i:0       dp:f:0.7382984685617883

Yet the graph includes linkages between 12 and other segments:

cat L*A13.assembly.gfa | grep ^L | grep 12
L       12      +       8       +       0M
L       12      +       11      +       0M
L       11      +       12      +       0M
L       12      -       8       +       0M