rvaser / rala

Layout module for raw de novo genome assembly of long uncorrected reads.
MIT License
21 stars 3 forks source link

Supporting GFAv1 #2

Closed ekg closed 3 years ago

ekg commented 6 years ago

Could you easily produce GFAv1?

rvaser commented 6 years ago

I pushed a simple GFAv1 graph output to master, although it is not enabled by default (you can put it anywhere in code). I am only printing reads and unitigs with their lengths and read counts (without their sequences), and links between them (without cigar string). I have drawn two graphs in Bandage and it seems to work (both are from the same ecoli dataset during different steps in assembly). Do you need more information or will this suffice? Do you want command line argument which enables GFA output after each step?

image

image

ekg commented 6 years ago

Thanks! That looks cool.

My objective is to obtain a single file that captures the full information and sequences from the assembly. For my use I need a blunt-ended bidirectional string graph. We should have sequences in the nodes. If the graph is formatted as an overlap graph, then the cigars on the links should describe the approximate length of the overlap.

I want to use the graph in vg, which has a more restricted interpretation of sequence graphs--- they are not approximate and are meant to precisely encode regular languages that describe the information in the input to the assembly.

rvaser commented 6 years ago

I added sequences and cigar strings so now the output looks like:

S [name] [sequence] LN:i:[length] RC:i:[one or number of reads in unitig] L [source] [source orientation] [destination] [destination orientation] [overlap length]M

For each link there exists a pair, e.g. for link (1+) > (2-) its pair is (2+) > (1-). I hope I understood your requirements.

rvaser commented 6 years ago

If you need any assistance for enabling the GFA output or disabling some features (like heuristic graph cuts or preprocessing) let me know!