markhilt / ARBitR

ARBitR: Assembly Refinement with Barcode-identity-tagged Reads
Other
10 stars 1 forks source link

Sequences occur more than once in the output #2

Closed olekto closed 4 years ago

olekto commented 4 years ago

Hi Markus, I'm testing ARBitR against different versions of my assembly. However, the final assembly is often larger than the input. For instance, the input assembly was 688 Mbp, and the output was 711 Mbp. That was strange I thought. I mapped the input and the output against each other using minimap2 and got this:

scaffold_8  7150795 2288900 2867235 -   scaffold_8010   578385  0   578335  578235  578235  60  NM:i:100    ms:i:578135 AS:i:578135 nn:i:100    tp:A:P  cm:i:50580  s1:i:518937 s2:i:23132  de:f:0  rl:i:740701 cs:Z::578335
scaffold_49 11269021    10570709    10571427    -   scaffold_8010   578385  530867  531484  610 618 28  NM:i:108    ms:i:298    AS:i:270    nn:i:10tp:A:P   cm:i:43 s1:i:450    s2:i:443    de:f:-0.1753    rl:i:1163772    cs:Z::80+aa:88*ct:8*gc:45*tc:34+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn*tn:184*gt:2*gc:104*ta:65
scaffold_169    7216630 2354735 2933070 -   scaffold_8010   578385  0   578335  578235  578235  60  NM:i:100    ms:i:578135 AS:i:578135 nn:i:100    tp:A:P  cm:i:50580  s1:i:518937 s2:i:23132  de:f:0  rl:i:744527 cs:Z::578335

The ARBitR assembly is the subject, with scaffold_8, scaffold_49 and scaffold_169. The input assembly has scaffold_8010. Here, you can see that scaffold_8010 has been included twice in the ARBitR assembly, in its whole 578235 bp length.

This is the corresponding backbone.gfa: L scaffold_8010 - contig_3241 + L contig_3211 + scaffold_8010 - L scaffold_8010 + contig_3211 + L contig_5534 + scaffold_8010 - L contig_3241 - scaffold_8010 + L contig_3211 - scaffold_8010 - L scaffold_8010 + contig_5534 - L scaffold_8010 + contig_3211 -

If I look at the pre-merge.paths.txt:

8   [start: contig_8098e, target: contig_4707s, connections: [], start: contig_4707e, target: contig_7789s, connections: ['contig_2054'], start: contig_7789e, target: contig_7773s, connections: [], start: contig_7773e, target: contig_5534s, connections: [], start: contig_5534e, target: scaffold_8010e, connections: ['contig_3211'], start: scaffold_8010s, target: contig_3241s, connections: [], start: contig_3241e, target: contig_3582s, connections: [], start: contig_3582e, target: contig_2101s, connections: ['contig_2100'], start: contig_2101e, target: contig_7801s, connections: ['contig_2100'], start: contig_7801e, target: contig_8203e, connections: ['contig_3922', 'contig_3920'], start: contig_8203s, target: contig_7560e, connections: [], start: contig_7560s, target: contig_7827e, connections: ['contig_3638', 'contig_3639', 'contig_7562'], start: contig_7827s, target: contig_1322e, connections: [], start: contig_1322s, target: contig_2866s, connections: [], start: contig_2866e, target: contig_1751s, connections: ['contig_2864']]
169 [start: contig_8100s, target: contig_8098s, connections: [], start: contig_8098e, target: contig_4707s, connections: [], start: contig_4707e, target: contig_7789s, connections: ['contig_2054'], start: contig_7789e, target: contig_7773s, connections: [], start: contig_7773e, target: contig_5534s, connections: [], start: contig_5534e, target: scaffold_8010e, connections: ['contig_3211'], start: scaffold_8010s, target: contig_3241s, connections: [], start: contig_3241e, target: contig_3582s, connections: [], start: contig_3582e, target: contig_2101s, connections: ['contig_2100'], start: contig_2101e, target: contig_7801s, connections: ['contig_2100'], start: contig_7801e, target: contig_8203e, connections: ['contig_3922', 'contig_3920'], start: contig_8203s, target: contig_7560e, connections: [], start: contig_7560s, target: contig_7827e, connections: ['contig_3638', 'contig_3639', 'contig_7562'], start: contig_7827s, target: contig_1322e, connections: [], start: contig_1322s, target: contig_2866s, connections: [], start: contig_2866e, target: contig_1751s, connections: ['contig_2864']]

These two are quite similar. Indeed, if I map them against each other, scaffold_169 has only 65 kbp not in scaffold_8 (both are longer than 7 Mbp).

How do I avoid cases like this?

Thank you.

Ole

markhilt commented 4 years ago

Hi Ole,

Thanks for reporting this. I've seen similar issues before, and the problem is that the path is not terminated at the circular contig. I'll try to push a fix soon.

In the meantime you can try to tweak the parameters to avoid the circular contig, e.g. by increasing/reducing -s and -q. If that doesn't solve it you can also try different -b, -F, and -f.

olekto commented 4 years ago

Thank you Markus.

Both -s and -q are default (20000 and 60). I could try increasing -s to see what effects that have. However, scaffolding now takes about 14 hours, so it will take time testing.

Changing -b and -F might also reduce the time spent scaffolding? I'll try changing a bit and I'll report back.

Ole

markhilt commented 4 years ago

Choosing -s depends on the fragment sizes of your input DNA. If you had very long DNA you can increase -s - that should boost the number of barcodes differentiating regions and may give you better resolution in the link graph with fewer spurious edges. On the other hand, shorter -s may give linkage information about contigs that are physically distant to each other, so it's worth experimenting.

About the computing time there is unfortunately not much to do at the moment parameter-wise. The time will depend primarily on the number of contigs in your input assembly and secondarily on your read coverage. You could try downsampling your read mappings and risk losing some information for the benefit of reduced computing time. Another option, if you're not interested in anchoring short contigs, would be to filter your input assembly to remove contigs shorter than a given value. That way you will get more gaps in your output scaffolds but the time spent in the scaffolding step should be reduced. Keep in mind that you would also have to remove mappings to those contigs from your bam file in that case.

olekto commented 4 years ago

Great, thank you for your answer.

I certainly want to place as many small contigs as possible (as long as they are placed accurately), so I'll just wait and see what works.

markhilt commented 4 years ago

The v0.2 update includes a bug fix that I believe should remedy the problem of duplicated contigs. Also, multiprocessing is now supported for the scaffolding step which should bring down computing time hopefully to more managable levels.

I'll now close this issue, please reopen if the problem is still there.