Export graphs to .gfa format

ratschlab / metagraph

Scalable annotated de Bruijn graphs for DNA indexing, alignment, and assembly

http://metagraph.ethz.ch

GNU General Public License v3.0

110 stars 17 forks source link

Export graphs to .gfa format #47

Closed hmusta closed 4 years ago

trgibbons commented 4 years ago

We've been doing some tests creating graphs from yeast and maize genomes using metagraph, minigraph, and seqwish. So far, metagraph is the only tool to successfully create a graph from maize sequences, but we aren't sure how to inspect or use the output files.

We would greatly appreciate support for one or more standard formats 🙏

chrisbarber commented 4 years ago

Example usage:

./metagraph assemble -o test_graph --unitigs --to-gfa test_graph.dbg

Outputs test_graph.gfa looking like:

S       351710  TAACCCGGCCGCTAGCGCGGGGGCGCTGGGCCCCGCTGGGATCGATGCGGGCGGCCGCGCCGGCTGGGCTCTGCGGGCTGGCACCCGGCCCGGGGCGGGACCCACCTCCGCTTTCGGATCTCCTGAGGGTCCGG
S       333879  GATCTCCTGAGGGTCCGGCAGGAGGTGGCGGCTGCAGCTCTGAGG
L       351708  +       333879  +       0M
L       351709  +       333879  +       0M
L       351710  +       333879  +       0M
S       315818  GCGGCTGCAGCTCTGAGGGGCCCCAGTGGCCTGGAAGCCCACCTGCCCTCCTCCACGGCAGG
L       333878  +       315818  +       0M
L       333879  +       315818  +       0M
S       508660  GCCCTCCTCCACGGCAGGCTACGGCTTCCTG
L       315818  +       508660  +       0M
S       435067  GCAGGCTACGGCTTCCTGC
L       508660  +       435067  +       0M
L       508661  +       435067  +       0M
...

Only tested the gfa by manual inspection and loading in bandage

trgibbons commented 4 years ago

Thanks for implementing the GFA support! I'm sorry trouble you again so soon, but could you please reopen this? We have two further requests/sub-issues:

We noticed that the GFA files did not contain any path/walk information, only segments (nodes) and links (edges). This is crucial for using metagraph as a replacement for whole-genome alignment en route to variant graphs.
We discovered that using vg to convert the metagraph GFA files into vg format used hundreds of gigabytes of RAM for relatively small files. We happen to have very large memory machines, but metagraph is generally so much more memory efficient than vg, that we were really hoping you might still consider outputting vg files directly 🙏

karasikov commented 4 years ago

I've opened a new issue (#69) for conversion to .vg.

We discovered that using vg to convert the metagraph GFA files into vg format used hundreds of gigabytes of RAM for relatively small files.

Currently, the segments in gfa are indexed according to their last k-mers and their indexes in dbg. Thus, in general, the segments are indexed with very large indexes (~10^9 instead of ~10^6). Can this be the cause of this issue? Could you check it for a gfa file that can be converted to vg (just add a large number to all indexes in the gfa and try to convert to vg again)?

We noticed that the GFA files did not contain any path/walk information, only segments (nodes) and links (edges). This is crucial for using metagraph as a replacement for whole-genome alignment en route to variant graphs.

There is no metadata in pure de Bruijn graphs which could be used for deriving its paths. We could define paths as unitigs, but they would be equivalent to the current segments in gfa. Does it make sense?

Besides the simple dbg -> gfa transform, we could map other sequences to graph and define paths in gfa as their segments aligned to graph (in the simplest case, with k-mer matching). Would this help? The algorithm would be:

transform dbg -> gfa as previously, without paths;
map/align sequences to dbg and extend gfa with paths.

trgibbons commented 4 years ago

Thanks for opening the new issue!

When we spoke with André about exporting the dbg files to gfa, I had hoped that it would be the annotated dbg files he mentioned, and that the resulting gfa files would contain the genome paths.

Any solution that allows us to obtain vg files (including genome paths/walks for large and complex genomes) is acceptable for us. It would be great if the solution was also fast, memory efficient, and didn't require the generation of an intermediate (gfa) file that was was ~3x larger than either the input (dbg) or output (vg) files, but we can classify those as "nice to haves" for the moment 😄

As for the "indexes" in the gfa file, I'm afraid that I'm not sure what you mean. Do you mean the segment/node identification number? (numerical middle column in the lines beginning with S)

S   33758384    TAGCATTACATATTATACTGAAATGTCAAACGTAAAACAATACATCATTCAATTTATCCTTTAATAATAAGAAATAAACACATTGACTTGGGGTAGGAACACTATCATGATATTATTAACGTAAAAGTTCCTTAATATTGTCATGTACAGTACCATATAATCTTGCCATTTGCTTAAACAGATGC
S   51412402    TACAGTACCATATAATCTTGCCATTTGCTTAAACAGATGCTATTTT
L   33758383    +   51412402    +   0M
L   33758384    +   51412402    +   0M
S   10151683    ACCATATAATCTTGCCATTTGCTTAAACAGATGCTATTTTAAAATATTCAATACTTTCACAGATTCCACATTAGAA
L   51412402    +   10151683    +   0M
S    5976133    TTTTAAAATATTCAATACTTTCACAGATTCCACATTAGAACAT
L   10151683    +   5976133 +   0M
L   10151684    +   5976133 +   0M
S   21381844    TAAAATATTCAATACTTTCACAGATTCCACATTAGAACATATGTCATACCACTATGTTGAATCTGATAACCCAGAAGCA

chrisbarber commented 4 years ago

The gfa will not be so large if the sequence data is kept in a separate compressed fasta, which is easy to implement but I'm not sure vg or other tools actually support this. From a quick look at vg's gfa loading code it doesn't look to be supported. Without something like this there is no way the gfa file size is going to be competitive with dbg for example.

What was the size of the gfa you were trying to load into vg? If vg was using orders of magnitude more memory than the file itself that would sound like an issue with their gfa loading. I will try to reproduce on my end, but I will take some time to do so, so I can build vg and troubleshoot in more detail versus just running their precompiled binaries.

trgibbons commented 4 years ago

We observed this with the YPRP yeast strains, as well as some public maize genomes. The dbg, gfa, and vg files from the 12 yeast genomes are respectively 55, 178, and 46 MB.

metagraph used 129 MB of memory to convert the dbg to gfa, while vg view used 4.45 GB to convert that gfa to vg.

This caused us to wonder if it might be possible to simultaneously skip both the large intermediate gfa file and the expensive vg view conversion step by converting directly from dbg to vg using metagraph.

I'm happy to share the data files, if that would be helpful.

akahles commented 4 years ago

@trgibbons, could you please quickly clarify your exact use case? What is the input to building the graphs? Raw sequencing data, draft assemblies, contigs? Depending on that context, what information should be encoded in the paths?

Thanks!

trgibbons commented 4 years ago

As a starting point, we are focused on comparing pseudochromosome-quality reference genomes. The paths should be the pseudochromosome sequences.

From what I can tell, we are doing exactly what was demonstrated in the Example from the README:

DATA="../tests/data/transcripts_1000.fa"
./metagraph build -k 12 -o transcripts_1000 $DATA
./metagraph annotate -i transcripts_1000 --anno-filename -o transcripts_1000 $DATA
./metagraph query -i transcripts_1000 -a transcripts_1000.column.annodbg $DATA
./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000

except that we are using the 12 YPRP yeast genomes instead of the transcripts_1000.fa sequences.

I have so far been unable to inspect the dbg or column.annodbg files, but I assume that the column.annodbg files contains the path information for the sequences used to create the dbg, and am hoping that these files can be combined to generate gfa and/or vg files that contain path information for the input sequences.

chrisbarber commented 4 years ago

@trgibbons Conceptually the annotation is a 2-D binary matrix, kmer x annotation_label, where kmer is over the space of all kmers stored in the de Bruijn graph, and annotation labels are chosen by the user. So each annotation label (column) of the matrix implies a subset of kmers (rows)-- those rows marked with 1 as opposed to 0 in that column-- within the overall de Bruijn graph. Annotating a sequence in the graph just finds the matching kmers and marks those rows 1.

In the general case, one such set of rows can match an entirely arbitrary subset of the kmers/rows in the overall de Bruijn graph.

If the annotated sequence was also used when building the graph, as in your example, then all the kmers will match, and the annotated rows (marked as 1 in a column) will correspond to a contiguous part of the overall de Bruijn graph. But even then it could contain loops for example, meaning that even in this case we cannot recover the original sequence or path information from graph+annotation alone. Additional path information would need to be stored, which we are not doing.

Note also that one can annotate a graph with sequences whose kmers are not all present in the graph (unlike your example). As in, metagraph build with one file or set of files, and metagraph annotate with another file or file(s). In this case the annotated rows could be arbitrarily dispersed in the overall de Bruijn graph, depending only on which kmers matched. Same as before there is no path information, but now also the annotated portion of the graph could be disconnected.

On to the selection of annotation labels-- I said these are chosen by the user. Some options can be seen in metagraph annotate -h. But --anno-filename (as in the example you have referenced) will create one annotation column per file passed on the command line (you can pass multiple), with the label for that column being the filename. If $DATA is multiple files then you will get multiple columns. The other major option is --anno-header, where you will get a new label and column for each sequence based on its name in the FASTA.

You can inspect the columns/labels using something like:

./metagraph stats --print-col-names -a test_graph.column.annodbg test_graph.dbg

Or you can omit --print-col-names and just see a summary of the annotation matrix.

Anyways back to the question about path information and generating gfa/vg-- in the current metagraph I think the best that can be generated from an existing graph+annotation is dump contigs for each column in the annotation, and call those paths. While this could retain a little more path information than unitigs from the full graph alone, it is still very lossy and also depends on how many sequences were fed to each annotation column.

Mentioned at the end of https://github.com/ratschlab/projects2014-metagenome/issues/47#issuecomment-541785010 is some functionality we could add pretty easily for generating path information in gfa/vg/other. But it would be another distinct processing step, just for generating files with path information. In this case the annotation matrix would not be used/needed. It would look like:

Build de Bruijn graph from a bunch of sequences
Reprocess those same sequences in order to walk their paths within the DBG, outputting those paths/walks to gfa/vg as we go

Still, the path information would not be encoded in our dbg or annotation files, it would just dump to gfa/vg. And in the naive case we would just include one path for every input sequence, rather than somehow deduplicating overlapping/consistent paths-- point being that even for this proposed functionality it still might not be quite what you expect/want; I'm not sure.

akahles commented 4 years ago

@karasikov We have briefly discussed this in our last meeting. From the description above, I still think that the paths could be encoded using the machinery that Jan developed during his masters. What do you think?

karasikov commented 4 years ago

@karasikov We have briefly discussed this in our last meeting. From the description above, I still think that the paths could be encoded using the machinery that Jan developed during his masters. What do you think?

Well, if GFA with paths is all we need, I don't see a point to obfuscate the problem.

hmusta commented 4 years ago

The point is that for metagraph to be more useful to those studying collections of reference genomes, there needs to be a way to store paths corresponding to reference chromosomes. Storing all of these paths as a large GFA is inefficient.

I think a better method would be to have all paths encoded succinctly using the method Jan developed, then provide the option to export the graph with a user-chosen subset of these paths as a GFA file.

hmusta commented 4 years ago

But since his work isn't "production ready" yet, the quick solution for now would be to have a mode which maps a FASTA file to the graph and outputs the resulting paths in the GFA format.

karasikov commented 4 years ago

The point is that for metagraph to be more useful to those studying collections of reference genomes, there needs to be a way to store paths corresponding to reference chromosomes. Storing all of these paths as a large GFA is inefficient.

Please tell me why it is inefficient.

karasikov commented 4 years ago

To clarify, that stuff was designed for other purposes.

As far as I understood, this thread is about using Metagraph for building GFA files. This problem can be easily and more efficiently solved by simple mapping. Am I missing something?

hmusta commented 4 years ago

It would depend on what's being stored. If a large collection of chromosomes that share similar blocks is being stored, then there would be lots of redundancy in the GFA if they're stored as paths.

The GFA standard does support storing these as variant calls, which would be even more efficient, but this opens its own can of worms if the inputs are reference chromosomes because one would have to compute an MSA to get this.

hmusta commented 4 years ago

To clarify, that stuff was designed for other purposes.

As far as I understood, this thread is about using Metagraph for building GFA files. This problem can be easily and more efficiently solved by simple mapping. Am I missing something?

The points I'm bringing up are ideas for future work that would supersede this particular issue. We could solve this one with the simple mapping and open a new feature request.

akahles commented 4 years ago

Ok, I see the point that aligning back a given fasta to the graph to extract the path for GFA is the most immediate solution. Let's keep the construction, permanent representation and export of paths as an important feature to be developed.

trgibbons commented 4 years ago

Thanks to @chrisbarber for the detailed explanation and to everyone for taking the time to discuss this. I think all of the important points have already been made. I just want to add that the performance of metagraph query might not scale to hundreds or thousands of large genomes, but it's certainly good enough for us for now. The GFA intermediate is a much more restrictive bottleneck.

Here are more detailed performance statistics from our tests creating a single graph from five complete maize genomes. We ran the jobs on an older, smaller node. We gave them 40 threads, but most of the time they only used a handful of these. The memory usage is based on the maximum values I happened to observe in HTOP over the course of several hours. I did not closely watch HTOP the entire time, so it's possible that some steps used more memory.

Timing

k	Build	Annotate	Query
21	12m	82m	37m
31	25m	80m	35m
41	26m	79m	32m
51	56m	77m	33m

Memory

k	Build	Annotate	Query
21	80G	90G	105G
31	150G	90G	200G
41†	80G	40G	200G
51	300G	95G	220G

† These are the values I observed, but I suspect they grew higher when I wasn't watching

Disk space

k	.dbg	.edgemask
21	1.3G	188M
31	1.8G	272M
41	2.3G	347M
51	2.8G	413M

chrisbarber commented 4 years ago

Hi @trgibbons, I was just thinking I should give you an update (or "non-update") on the matter of metagraph GFA -> vg conversion:

I am still thinking that the gfa output by metagraph is okay. The segment id's that @karasikov mentioned (https://github.com/ratschlab/projects2014-metagenome/issues/47#issuecomment-541785010) are not as short as they could be, but the memory usage by vg that you were seeing (https://github.com/ratschlab/projects2014-metagenome/issues/47#issuecomment-542199393) is about 40x times the size of the gfa file.

My inclination is to report this issue to vg, but I thought of one thing to try first. From taking a look at what I believe to be vg's loading code (https://github.com/vgteam/vg/blob/65cef16db927008bfbee50c072907c25a3853bf1/src/gfa.cpp#L143-L1075), it looks like they use https://github.com/edawson/gfakluge for loading. Something relatively easy to do would be to try gfakluge command line utilities with gfa's output by metagraph, and see how the memory usage looks. This information might help vg to troubleshoot, or perhaps it could reveal whether gfakluge itself is already using excess memory for a given gfa, in which case the issue could be reported directly to them.

I have gone ahead and done this with a small (~1MB) gfa file. I installed gfakluge via linuxbrew (brew install brewsci/bio/gfakluge), generated this example gfa using metagraph via

./metagraph build -k 19 --graph succinct -o test_graph ../tests/data/transcripts_100*
./metagraph assemble -o test_graph --unitigs --to-gfa test_graph.dbg

Then run gfak merge which among the various gfakluge commands appears to be one which actually loads the whole graph into memory (and makes sense that it should need to do so, as opposed to some of the other commands), e.g. by "merging" with just one gfa, or by merging the same gfa with itself:

\time -v gfak merge test_graph.gfa
\time -v gfak merge test_graph.gfa test_graph.gfa

In both cases with this ~1MB gfa, I get around 16MB peak memory. With such a small file though it is not entirely clear, so I would like to repeat the experiment with your gfa file. Do you have a way to share one of your large (e.g. 178MB) metagraph gfa files with me, and I can try this again? Or if it's easier you could repeat the above steps (install gfakluge, run the merge command).

This information I think would be a good start in knowing if/where to report the issue upstream, and nice information for whoever ends up taking it further.

Also, just to reiterate regarding reducing the large gfa file size on disk: it is easy for me to implement a gfa output mode where the sequence data is stored in external compressed fasta, which should help a lot. From a quick look, it appears that gfakluge supports this (e.g. https://github.com/edawson/gfakluge/blob/master/examples.md#replacing-sequence-placeholders-with-fasta-records), which may indicate that vg likewise can load these, but I haven't confirmed this.

karasikov commented 4 years ago

Thanks to @chrisbarber for the detailed explanation and to everyone for taking the time to discuss this. I think all of the important points have already been made. I just want to add that the performance of metagraph query might not scale to hundreds or thousands of large genomes, but it's certainly good enough for us for now. The GFA intermediate is a much more restrictive bottleneck.

Here are more detailed performance statistics from our tests creating a single graph from five complete maize genomes. We ran the jobs on an older, smaller node. We gave them 40 threads, but most of the time they only used a handful of these. The memory usage is based on the maximum values I happened to observe in HTOP over the course of several hours. I did not closely watch HTOP the entire time, so it's possible that some steps used more memory.

Timing

k Build Annotate Query 21 12m 82m 37m 31 25m 80m 35m 41 26m 79m 32m 51 56m 77m 33m

Memory

k Build Annotate Query 21 80G 90G 105G 31 150G 90G 200G 41† 80G 40G 200G 51 300G 95G 220G † These are the values I observed, but I suspect they grew higher when I wasn't watching

Disk space

k .dbg .edgemask 21 1.3G 188M 31 1.8G 272M 41 2.3G 347M 51 2.8G 413M

Hi Ted,

It would be interesting to see the stats of the databases. ./metagraph stats graph.dbg ./metagraph stats -a annotation.-.annodbg Could you please show these?

What data do you use for querying? Is it the same 5 genomes that you indexed in graphs?

trgibbons commented 4 years ago

@chrisbarber Here's a temporary Dropbox link to the 178 Mb GFA file for 12 YPRP yeast genomes: https://www.dropbox.com/s/h0yccs7j37its8d/yprp_metagraph.gfa.gz?dl=0

@karasikov I started by copying the Example commands directly out of the README.md: https://github.com/ratschlab/projects2014-metagenome#example

I then increased -k and -p, and redirected DATA="../tests/data/transcripts_1000.fa" to one of two files I'm using for testing. The smaller file contains the 12 YPRP yeast genomes. The larger file contains 5 maize genomes (B73 + 4x Flint lines).

Looking at the statistics for the annotations (see below), I'm wondering if it would be better in this case to annotate the .dbg files using separate FASTA files for each genome, rather than the combined FASTA file used to build the graph. I also just noticed that the .column.annodbg files are all 352 bytes for maize, regardless of _k_mer size, and only slightly larger than the 306 byte .column.annodbg file created for yeast.

yprp_metagraph.dbg

Statistics for graph 10_yeast/10_YPRP/yprp_metagraph.dbg
====================== GRAPH STATS =====================
k: 41
nodes (k): 53059818
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 41
nodes (k-1): 52549133
edges ( k ): 53062693
state: fast
W stats: {'$': 102, 'A': 16425047, 'C': 10081878, 'G': 10114934, 'T': 16440732}
F stats: {'$': 5, 'A': 16420969, 'C': 10106727, 'G': 10096602, 'T': 16438390}
========================================================
Statistics for annotation 10_yeast/10_YPRP/yprp_metagraph.column.annodbg
=================== ANNOTATION STATS ===================
labels:  1
objects: 53059818
density: 1.000000e+00
representation: column
========================================================

Zm_B73+Flint_pseudochromosomes_metagraph_k21.dbg

Statistics for graph 20_maize/10_maize_five/k21/Zm_B73+Flint_pseudochromosomes_metagraph_k21.dbg
====================== GRAPH STATS =====================
k: 21
nodes (k): 1219948871
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 21
nodes (k-1): 1159980281
edges ( k ): 1220038973
state: fast
W stats: {'$': 7090, 'A': 329729111, 'C': 276264090, 'G': 281211323, 'T': 332827359}
F stats: {'$': 5, 'A': 329564341, 'C': 277138405, 'G': 281026823, 'T': 332309399}
========================================================
Statistics for annotation 20_maize/10_maize_five/k21/Zm_B73+Flint_pseudochromosomes_metagraph_k21.column.annodbg
=================== ANNOTATION STATS ===================
labels:  1
objects: 1219948871
density: 1.000000e+00
representation: column
========================================================

Zm_B73+Flint_pseudochromosomes_metagraph_k31.dbg

Statistics for graph 20_maize/10_maize_five/k31/Zm_B73+Flint_pseudochromosomes_metagraph_k31.dbg
====================== GRAPH STATS =====================
k: 31
nodes (k): 1761834791
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 31
nodes (k-1): 1710657378
edges ( k ): 1762139673
state: fast
W stats: {'$': 15870, 'A': 473500941, 'C': 403077400, 'G': 408838160, 'T': 476707302}
F stats: {'$': 5, 'A': 473376189, 'C': 404093110, 'G': 408271155, 'T': 476399214}
========================================================
Statistics for annotation 20_maize/10_maize_five/k31/Zm_B73+Flint_pseudochromosomes_metagraph_k31.column.annodbg
=================== ANNOTATION STATS ===================
labels:  1
objects: 1761834791
density: 1.000000e+00
representation: column
========================================================

Zm_B73+Flint_pseudochromosomes_metagraph_k41.dbg

Statistics for graph 20_maize/10_maize_five/k41/Zm_B73+Flint_pseudochromosomes_metagraph_k41.dbg
====================== GRAPH STATS =====================
k: 41
nodes (k): 2247469043
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 41
nodes (k-1): 2201842265
edges ( k ): 2248059202
state: fast
W stats: {'$': 20968, 'A': 601320298, 'C': 518268216, 'G': 524158892, 'T': 604290828}
F stats: {'$': 5, 'A': 600903365, 'C': 519621122, 'G': 523656742, 'T': 603877968}
========================================================
Statistics for annotation 20_maize/10_maize_five/k41/Zm_B73+Flint_pseudochromosomes_metagraph_k41.column.annodbg
=================== ANNOTATION STATS ===================
labels:  1
objects: 2247469043
density: 1.000000e+00
representation: column
========================================================

Zm_B73+Flint_pseudochromosomes_metagraph_k51.dbg

Statistics for graph 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.dbg
====================== GRAPH STATS =====================
k: 51
nodes (k): 2680522906
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 51
nodes (k-1): 2640373893
edges ( k ): 2681443628
state: fast
W stats: {'$': 24296, 'A': 714622497, 'C': 621945712, 'G': 627287376, 'T': 717563747}
F stats: {'$': 5, 'A': 714269757, 'C': 623066425, 'G': 627044322, 'T': 717063119}
========================================================
Statistics for annotation 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.column.annodbg
=================== ANNOTATION STATS ===================
labels:  1
objects: 2680522906
density: 1.000000e+00
representation: column
========================================================

karasikov commented 4 years ago

@karasikov I started by copying the Example commands directly out of the README.md: https://github.com/ratschlab/projects2014-metagenome#example

I then increased -k and -p, and redirected DATA="../tests/data/transcripts_1000.fa" to one of two files I'm using for testing. The smaller file contains the 12 YPRP yeast genomes. The larger file contains 5 maize genomes (B73 + 4x Flint lines).

Looking at the statistics for the annotations (see below), I'm wondering if it would be better in this case to annotate the .dbg files using separate FASTA files for each genome, rather than the combined FASTA file used to build the graph. I also just noticed that the .column.annodbg files are all 352 bytes for maize, regardless of _k_mer size, and only slightly larger than the 306 byte .column.annodbg file created for yeast.

...

Zm_B73+Flint_pseudochromosomes_metagraph_k51.dbg
Statistics for graph 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.dbg
====================== GRAPH STATS =====================
k: 51
nodes (k): 2680522906
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 51
nodes (k-1): 2640373893
edges ( k ): 2681443628
state: fast
W stats: {'$': 24296, 'A': 714622497, 'C': 621945712, 'G': 627287376, 'T': 717563747}
F stats: {'$': 5, 'A': 714269757, 'C': 623066425, 'G': 627044322, 'T': 717063119}
========================================================
Statistics for annotation 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.column.annodbg
=================== ANNOTATION STATS ===================
labels:  1
objects: 2680522906
density: 1.000000e+00
representation: column
========================================================

It looks like you annotated all k-mers in graph with the same label. This is why it's easy to compress annotation to less than 1 Kb in size.

You should pass flag --anno-header when annotating the graph with multiple genomes stored in a single fasta file like:

>genome1
ACGACACTAGTC...
>genome2
CATGACGATGCA...

Alternatively, you can split the genomes into multiple fasta files and build annotations from these with flag --anno-filename.

We have faster representations for graph (e.g., --graph hash), but it uses way more memory.

Are you querying the same sequences that you used for graph construction (i.e., all k-mers from the queried sequence are indexed in graph) or something with k-mers not indexed in graph?

The succinct graph representation is slower for k-mer misses, but we are currently working on improving this. For the case where all k-mers from the query are in graph, however, it's hard to speed up queries further. This is the cost we have to pay for storing k-mers in a compressed data structure.

chrisbarber commented 4 years ago

Thanks @trgibbons for providing the gfa. I don't think my experiment has clarified anything, so I will have to think what to do next, but here are the results nonetheless:

[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 10000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
        User time (seconds): 7.99
        Maximum resident set size (kbytes): 12364
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 20000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
        User time (seconds): 32.86
        Maximum resident set size (kbytes): 22976
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 30000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
        User time (seconds): 83.73
        Maximum resident set size (kbytes): 33588
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 40000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
        User time (seconds): 163.39
        Maximum resident set size (kbytes): 44176
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 50000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
        User time (seconds): 267.75
        Maximum resident set size (kbytes): 54796

In contrast to what might have been expected, the runtimes are super-linear but the memory seems fine.

trgibbons commented 4 years ago

@karasikov

What data do you use for querying? Is it the same 5 genomes that you indexed in graphs?

Are you querying the same sequences that you used for graph construction (i.e., all k-mers from the queried sequence are indexed in graph) or something with k-mers not indexed in graph?

For each set of analyses (yeast or maize), I have been putting all of my sequences of interest into a single FASTA file, storing the path to that file in a $SEQS variable, and then passing that variable to every step (build, annotate & query), because that is how it is shown in the Example in the documentation: https://github.com/ratschlab/projects2014-metagenome#example

@chrisbarber

Thanks for looking into this and sharing your results. The performance of the GFA --> vg conversion is, for us, secondary to the more fundamental problem of Metagraph currently not encoding path data. I worry now that I unintentionally diverted this conversation away from our primary goal.

Our primary goal(s)

The ultimate goal of our group is to create and visualize genome variation graphs for fully-assembled, large, complex genomes. We have received a grant to work on genome graph visualization, but have gotten stuck trying to create the graphs we want to visualize. The other tools we've tried can either handle small (< 1 Gbp) complex genomes, or large datasets from human resequencing projects that narrowly focus on SNPs and other small variants, but cannot handle large (> 1 Gbp) complex genomes.

Part of our team works on agricultural (mostly polyploid plant) genomics. The other part works on human cancer genomics. It is therefore necessary for us to identify large, complex, and often nested structural variants, in addition to SNPs, CNVs, and other small variants.

We are happy to use any tools or intermediate files necessary to obtain these genome graphs, so long as they:

Can handle very large and very complex genomes
Retain the path information for these genomes
Can be imported into vg, which already supports a lot of the functionality we rely on scalable visualization

trgibbons commented 4 years ago

@karasikov

Alternatively, you can split the genomes into multiple fasta files and build annotations from these with flag --anno-filename.

I realized only after reading this that it's even possible to provide the input sequences in separate files. The commands in the metagraph documentation used only a single FASTA file containing all sequences of interest, so I assumed that metagraph was like seqwish in requiring all of the sequences to be combined into a single FASTA file.

If this is not the correct/preferred way to create a graph (with path information), then it would be helpful if the README.md was updated to reflect the best practices.

I have now re-run the metagraph pipeline with separate and consolidated sequences, and using the --anno-filename with annotate each time.

Separate sequences

The .dbg, .column.annodbg, and .edgemask files are respectively 42 MB, 66 MB, and 6.3 MB.

Statistics for graph yprp_metagraph.dbg
====================== GRAPH STATS =====================
k: 21
nodes (k): 40368796
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 21
nodes (k-1): 39526212
edges ( k ): 40369120
state: fast
W stats: {'$': 21, 'A': 12467772, 'C': 7694062, 'G': 7708496, 'T': 12498769}
F stats: {'$': 5, 'A': 12468867, 'C': 7703697, 'G': 7709491, 'T': 12487060}
========================================================
Statistics for annotation yprp_metagraph.column.annodbg
=================== ANNOTATION STATS ===================
labels:  12
objects: 40368796
density: 2.871733e-01
representation: column
========================================================

Combined sequences

The .dbg, .column.annodbg, and .edgemask files are respectively 42 MB, 306 B, and 6.3 MB.

Statistics for graph yprp_metagraph.dbg
====================== GRAPH STATS =====================
k: 21
nodes (k): 40368796
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 21
nodes (k-1): 39526212
edges ( k ): 40369120
state: fast
W stats: {'$': 21, 'A': 12467772, 'C': 7694062, 'G': 7708496, 'T': 12498769}
F stats: {'$': 5, 'A': 12468867, 'C': 7703697, 'G': 7709491, 'T': 12487060}
========================================================
Statistics for annotation yprp_metagraph.column.annodbg
=================== ANNOTATION STATS ===================
labels:  1
objects: 40368796
density: 1.000000e+00
representation: column
========================================================

Do the results look any better after separating the input sequences?

karasikov commented 4 years ago

Hi @trgibbons

Do the results look any better after separating the input sequences?

The graph is the same, which is expected as you constructed it from the same set of sequences. The annotation, however, did change. Now you indeed have 12 labels that mark the relation between the genomes and their k-mers.

If this is not the correct/preferred way to create a graph (with path information), then it would be helpful if the README.md was updated to reflect the best practices.

Thanks for the suggestion, we will indeed improve the readme as well as describe the usage and the most common use cases in wiki pages.

We are happy to use any tools or intermediate files necessary to obtain these genome graphs, so long as they:

Can handle very large and very complex genomes

Retain the path information for these genomes

Can be imported into vg, which already supports a lot of the functionality we rely on scalable visualization

I believe we can handle large genomes.

@chrisbarber, would it be possible to implement 3. in the nearest future (that is, add support for aligning sequences to graph and extend the GFA file with paths/walks)?

trgibbons commented 4 years ago

We are happy to use any tools or intermediate files necessary to obtain these genome graphs, so long as they:

Can handle very large and very complex genomes

Retain the path information for these genomes

Can be imported into vg, which already supports a lot of the functionality we rely on scalable visualization

I believe we can handle large genomes.

Exactly 👍 There are several tools capable of the second two points, but the first one is extraordinarily difficult. Metagraph is the first tool I've seen that addresses the first point, which is why I'm harassing you all the most 😇

chrisbarber commented 4 years ago

@chrisbarber Thanks for looking into this and sharing your results. The performance of the GFA --> vg conversion is, for us, secondary to the more fundamental problem of Metagraph currently not encoding path data. I worry now that I unintentionally diverted this conversation away from our primary goal.

@trgibbons Good to know that you will not necessarily be blocked by the GFA --> vg performance issue. I am pretty confident now that it is not a problem with the metagraph GFA layout (which is quite simple to begin with). I'm happy to be looped-in if you report an issue to them.

@chrisbarber, would it be possible to implement 3. in the nearest future (that is, add support for aligning sequences to graph and extend the GFA file with paths/walks)?

@karasikov Yes, I am glad to do this and I think it will be relatively straightforward. Opened #71 to track.