Closed olekto closed 4 years ago
Hi Ole,
Thanks for reporting this. I've seen similar issues before, and the problem is that the path is not terminated at the circular contig. I'll try to push a fix soon.
In the meantime you can try to tweak the parameters to avoid the circular contig, e.g. by increasing/reducing -s and -q. If that doesn't solve it you can also try different -b, -F, and -f.
Thank you Markus.
Both -s and -q are default (20000 and 60). I could try increasing -s to see what effects that have. However, scaffolding now takes about 14 hours, so it will take time testing.
Changing -b and -F might also reduce the time spent scaffolding? I'll try changing a bit and I'll report back.
Ole
Choosing -s depends on the fragment sizes of your input DNA. If you had very long DNA you can increase -s - that should boost the number of barcodes differentiating regions and may give you better resolution in the link graph with fewer spurious edges. On the other hand, shorter -s may give linkage information about contigs that are physically distant to each other, so it's worth experimenting.
About the computing time there is unfortunately not much to do at the moment parameter-wise. The time will depend primarily on the number of contigs in your input assembly and secondarily on your read coverage. You could try downsampling your read mappings and risk losing some information for the benefit of reduced computing time. Another option, if you're not interested in anchoring short contigs, would be to filter your input assembly to remove contigs shorter than a given value. That way you will get more gaps in your output scaffolds but the time spent in the scaffolding step should be reduced. Keep in mind that you would also have to remove mappings to those contigs from your bam file in that case.
Great, thank you for your answer.
I certainly want to place as many small contigs as possible (as long as they are placed accurately), so I'll just wait and see what works.
The v0.2 update includes a bug fix that I believe should remedy the problem of duplicated contigs. Also, multiprocessing is now supported for the scaffolding step which should bring down computing time hopefully to more managable levels.
I'll now close this issue, please reopen if the problem is still there.
Hi Markus, I'm testing ARBitR against different versions of my assembly. However, the final assembly is often larger than the input. For instance, the input assembly was 688 Mbp, and the output was 711 Mbp. That was strange I thought. I mapped the input and the output against each other using minimap2 and got this:
The ARBitR assembly is the subject, with scaffold_8, scaffold_49 and scaffold_169. The input assembly has scaffold_8010. Here, you can see that scaffold_8010 has been included twice in the ARBitR assembly, in its whole 578235 bp length.
This is the corresponding backbone.gfa: L scaffold_8010 - contig_3241 + L contig_3211 + scaffold_8010 - L scaffold_8010 + contig_3211 + L contig_5534 + scaffold_8010 - L contig_3241 - scaffold_8010 + L contig_3211 - scaffold_8010 - L scaffold_8010 + contig_5534 - L scaffold_8010 + contig_3211 -
If I look at the pre-merge.paths.txt:
These two are quite similar. Indeed, if I map them against each other, scaffold_169 has only 65 kbp not in scaffold_8 (both are longer than 7 Mbp).
How do I avoid cases like this?
Thank you.
Ole