lh3 / miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
MIT License
297 stars 68 forks source link

Explicitly add circularising links to unitig GFA #74

Closed rrwick closed 4 years ago

rrwick commented 4 years ago

This is a small change to the unitig GFA output. Instead of just relying on the last character of the unitig name to indicate circularity (l for linear and c for circular), this change gives circular unitigs an explicit overlap-free link in the GFA file.

Before:

S       utg000030c      GGAACAAACTGCATTAATTC...TATCACCAGAAAAGACAGAT LN:i:69013
a       utg000030c      0       0ab4d3ed-f533-4bfd-880c-fa1a0568443b:17-68951    +       3431
a       utg000030c      3431    1e81464e-9194-40f4-ba59-d29f9b1a7d65:33-68891    +       16745
a       utg000030c      20176   fb2dd196-c8bb-4770-b693-d02329c1dab2:1-67957     +       24

After:

S       utg000030c      GGAACAAACTGCATTAATTC...TATCACCAGAAAAGACAGAT LN:i:69013
L       utg000030c      +       utg000030c      +       0M
L       utg000030c      -       utg000030c      -       0M
a       utg000030c      0       0ab4d3ed-f533-4bfd-880c-fa1a0568443b:17-68951    +       3431
a       utg000030c      3431    1e81464e-9194-40f4-ba59-d29f9b1a7d65:33-68891    +       16745
a       utg000030c      20176   fb2dd196-c8bb-4770-b693-d02329c1dab2:1-67957     +       24

This change is mainly to make miniasm assemblies for bacterial genomes (in which circular unitigs are common) display as circular when loaded in Bandage.

Thanks so much for this great tool (and all your others), and let me know if you have any questions!

Ryan

lh3 commented 4 years ago

Thanks. As I remember, in the miniasm gfa, we only see

L       utg000030c      +       utg000030c      +       0M

but not

L       utg000030c      -       utg000030c      -       0M

I could be wrong. Haven't looked at the miniasm source code for a long time...

rrwick commented 4 years ago

For edges between linear contigs in the miniasm graph, both directions are shown, e.g.:

L   utg000013l  +   utg000036l  +   60794M  SD:i:46302
L   utg000036l  -   utg000013l  -   60576M  SD:i:62899

Which is why I included both directions for the circularising links as well. But yes, it is redundant, so only including one direction would be fine too!

lh3 commented 4 years ago

I see. Then you change is consistent. Thanks.