lh3 / minigraph

Sequence-to-graph mapper and graph generator
https://lh3.github.io/minigraph
MIT License
419 stars 38 forks source link

How to get the GFA format not default rGFA? #26

Closed Xuelei-Dai closed 3 years ago

Xuelei-Dai commented 3 years ago

Hello Li, I have used the minigraph to build the pangenome graphs, but get the rGFA format is not standard GFA format, so couldn't be used the input file of vg. How to get the GFA format when we use the minigraph to build the pangenome graph?

Best wishes~

lh3 commented 3 years ago

rGFA is the standard GFA. It is really vg that imposes various vg-specific constraints.

Xuelei-Dai commented 3 years ago

Thank you for your quick reply!

lh3 commented 3 years ago

The vg team knows how to run vg on the minigraph graphs. You may ask them.

Xuelei-Dai commented 3 years ago

Yes! But I have a question is that output GFA of the minigraph only contains S and L lines, I see the GFA format is like this https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md and the vg can accept the format, we should how to get this format?

H   VN:Z:1.0
S   11  ACCTT
S   12  TCAAGG
S   13  CTTGATT
L   11  +   12  -   4M
L   12  -   13  +   5M
L   11  +   13  +   3M
P   14  11+,12-,13+ 4M,5M

Best wishes~

ekg commented 3 years ago

The only restriction on vg GFAs is that the node names are numerical ids > 0. You can use gfautil id-convert (from https://github.com/chfi/rs-gfa-utils) to map between graphs with nodes that have string names and nodes with integer names.

The P lines are typically used to represent the mapping of a sequence into the graph. Perhaps you could derive these by mapping your original sequences back to the minigraph? You'd need to convert GAF format to P lines. I know of a few people doing that with GraphAligner output, but working on De Bruijn graphs. It should work here too. The representation will be approximate, but will show how the sequences map through the graph.

If you only want one P line per input chromosome/contig, and that's the first reference FASTA that you put into the minigraph, then you can represent it losslessly in the final graph with a set of P lines. You could probably derive this on top of the rGFA output, using the reference coordinate information, or patch minigraph to produce it directly.

If you want an exact or lossless version of the graph including all input contigs, you'll need to resolve the base-level relationships with alignment with cigars (minimap2 -c or edyeet) and seqwish. If you want a model with a local MSA for every part of the graph, you'd apply the pangenome graph builder, which extends the seqwish induction with partial order alignment, or alternatively a version of cactus that derives the MSA for each part of the minigraph (this is in development, I'm not sure where the code lives...).

Xuelei-Dai commented 3 years ago

Great! Thanks for your advise!