isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
271 stars 49 forks source link

Add .gfa output to racon in addition to .fasta #49

Open jelber2 opened 6 years ago

jelber2 commented 6 years ago

Hi, I was wondering if it were possible for racon to output a gfa file in addition to fasta?

rvaser commented 6 years ago

Hello, what would be the use case of the outputed gfa format (i.e. what information are you seeking from racon)?

Best regards, Robert

jelber2 commented 6 years ago

Well, I would like to input a gfa file from racon to the Hi-C scaffolding program SALSA (https://github.com/machinegun/SALSA). Granted I don't know if a gfa representation of the assembly would improve the output from SALSA or not.

My workflow is to take PacBio reads overlapped from minimap2 as input for miniasm to de novo assemble them and then call consensus with racon then input a gfa file from racon into SALSA then polish with pilon: minimap2->miniasm->racon->SALSA->pilon

But, I could alternatively do the following minimap2->miniasm->SALSA->racon->pilon

rvaser commented 6 years ago

I have tagged this as enhancement and will deal with it soon.

Best regards, Robert

sjackman commented 6 years ago

I'd find this feature useful as well. I'm polishing a Miniasm assembly using Racon. It'd be useful to preserve the graph after polishing with Racon. Consider supporting both GFA 1 and GFA 2.

rvaser commented 6 years ago

How should I preserve the GFA file? Sequences change and alignments might be invalidated.

sjackman commented 6 years ago

The GFA 1 output by Miniasm includes estimates of the amount of overlap, but doesn't include an actual alignment. So I think you could get away with not modifying the edges at all. The edges output by Miniasm look like this:

L   utg000001l  +   utg001226l  +   19386M  SD:i:5467

After the sequences are corrected by Racon, you could realign the two sequences incident to each edge, and it's possible that some of the ambiguities in the graph could be resolved post-Racon.

rvaser commented 6 years ago

It is a bit tedious to add the format into Racon as we only need the S rows. Wouldn't a simple post-processing script be an easier solution? A script that updates the GFA file with polished sequences and maybe realigns edges?

sjackman commented 6 years ago

A post-processing script may be easiest. That script would take in the GFA file produced by Miniasm, the FASTA file produced by Racon, and produce an updated GFA file. Is that script something that you're interested in creating? Or perhaps a task for Gfakluge or GfaPy.

rvaser commented 6 years ago

Well I might add such a script but I am not sure when I will get the time for it :/

sjackman commented 6 years ago

No worries. I'll let you know if I get around to it myself.

rvaser commented 6 years ago

Great, thanks!

mictadlo commented 6 years ago

Any updates?

sjackman commented 6 years ago

Not from me

rvaser commented 6 years ago

Neither from me :/

SamStudio8 commented 6 years ago

I don't suppose anyone had a chance to look at this?

rvaser commented 6 years ago

Unfortunately not :/ I'll try and deal with it later this year.

sjackman commented 6 years ago

I used this AWK script to take the sequence from polished.fasta the graph from draft.gfa and produce a polished.gfa file.

seqtk seq polished.fasta | gawk -vOFS='\t' 'ARGIND == 1 { id = substr($1, 2); getline; x[id] = $1; next } $1 == "S" && x[$2] { $3 = x[$2] } 1' - draft.gfa >polished.gfa

See also https://github.com/edawson/gfakluge and https://github.com/ggonnella/gfapy/ for manipulating GFA files. I'd still love to see this feature in Racon.

MChiaraC commented 5 years ago

so you basically taking the unpolished assembly graph and the new polished sequences and creating a polished graph? Am I correct?

jelber2 commented 5 years ago

That is what I understand @sjackman's code is doing.

sjackman commented 5 years ago

Yes. I'm working with an assembly graph whose edges are blunt (no overlap, 0M) from Flye or Unicycler. This simple script does not recompute the edge alignment for other assemblers.

MChiaraC commented 5 years ago

mmmh I see, than I cannot use it ...

sjackman commented 5 years ago

You could replace all the CIGAR strings with * (meaning unknown).

ardy20 commented 3 years ago

Hello Robert Any progress or update to create .gfa output by Racon?

jelber2 commented 3 years ago

See https://github.com/rrwick/Minipolish

rvaser commented 3 years ago

@ardy20, unfortunately no. Minipolish seems as a decent solution for this issue :)