Closed hmusta closed 4 years ago
Example usage:
./metagraph assemble -o test_graph --unitigs --to-gfa test_graph.dbg
Outputs test_graph.gfa
looking like:
S 351710 TAACCCGGCCGCTAGCGCGGGGGCGCTGGGCCCCGCTGGGATCGATGCGGGCGGCCGCGCCGGCTGGGCTCTGCGGGCTGGCACCCGGCCCGGGGCGGGACCCACCTCCGCTTTCGGATCTCCTGAGGGTCCGG
S 333879 GATCTCCTGAGGGTCCGGCAGGAGGTGGCGGCTGCAGCTCTGAGG
L 351708 + 333879 + 0M
L 351709 + 333879 + 0M
L 351710 + 333879 + 0M
S 315818 GCGGCTGCAGCTCTGAGGGGCCCCAGTGGCCTGGAAGCCCACCTGCCCTCCTCCACGGCAGG
L 333878 + 315818 + 0M
L 333879 + 315818 + 0M
S 508660 GCCCTCCTCCACGGCAGGCTACGGCTTCCTG
L 315818 + 508660 + 0M
S 435067 GCAGGCTACGGCTTCCTGC
L 508660 + 435067 + 0M
L 508661 + 435067 + 0M
...
Only tested the gfa by manual inspection and loading in bandage
Thanks for implementing the GFA support! I'm sorry trouble you again so soon, but could you please reopen this? We have two further requests/sub-issues:
I've opened a new issue (#69) for conversion to .vg.
We discovered that using vg to convert the metagraph GFA files into vg format used hundreds of gigabytes of RAM for relatively small files.
Currently, the segments in gfa
are indexed according to their last k-mers and their indexes in dbg
. Thus, in general, the segments are indexed with very large indexes (~10^9 instead of ~10^6). Can this be the cause of this issue? Could you check it for a gfa
file that can be converted to vg
(just add a large number to all indexes in the gfa
and try to convert to vg
again)?
We noticed that the GFA files did not contain any path/walk information, only segments (nodes) and links (edges). This is crucial for using metagraph as a replacement for whole-genome alignment en route to variant graphs.
There is no metadata in pure de Bruijn graphs which could be used for deriving its paths.
We could define paths as unitigs, but they would be equivalent to the current segments in gfa
. Does it make sense?
Besides the simple dbg
-> gfa
transform, we could map other sequences to graph and define paths in gfa
as their segments aligned to graph (in the simplest case, with k-mer matching). Would this help?
The algorithm would be:
dbg
-> gfa
as previously, without paths;dbg
and extend gfa
with paths.Thanks for opening the new issue!
When we spoke with AndrΓ© about exporting the dbg
files to gfa
, I had hoped that it would be the annotated dbg
files he mentioned, and that the resulting gfa
files would contain the genome paths.
Any solution that allows us to obtain vg
files (including genome paths/walks for large and complex genomes) is acceptable for us. It would be great if the solution was also fast, memory efficient, and didn't require the generation of an intermediate (gfa
) file that was was ~3x larger than either the input (dbg
) or output (vg
) files, but we can classify those as "nice to haves" for the moment π
As for the "indexes" in the gfa
file, I'm afraid that I'm not sure what you mean. Do you mean the segment/node identification number? (numerical middle column in the lines beginning with S
)
S 33758384 TAGCATTACATATTATACTGAAATGTCAAACGTAAAACAATACATCATTCAATTTATCCTTTAATAATAAGAAATAAACACATTGACTTGGGGTAGGAACACTATCATGATATTATTAACGTAAAAGTTCCTTAATATTGTCATGTACAGTACCATATAATCTTGCCATTTGCTTAAACAGATGC
S 51412402 TACAGTACCATATAATCTTGCCATTTGCTTAAACAGATGCTATTTT
L 33758383 + 51412402 + 0M
L 33758384 + 51412402 + 0M
S 10151683 ACCATATAATCTTGCCATTTGCTTAAACAGATGCTATTTTAAAATATTCAATACTTTCACAGATTCCACATTAGAA
L 51412402 + 10151683 + 0M
S 5976133 TTTTAAAATATTCAATACTTTCACAGATTCCACATTAGAACAT
L 10151683 + 5976133 + 0M
L 10151684 + 5976133 + 0M
S 21381844 TAAAATATTCAATACTTTCACAGATTCCACATTAGAACATATGTCATACCACTATGTTGAATCTGATAACCCAGAAGCA
The gfa will not be so large if the sequence data is kept in a separate compressed fasta, which is easy to implement but I'm not sure vg or other tools actually support this. From a quick look at vg's gfa loading code it doesn't look to be supported. Without something like this there is no way the gfa file size is going to be competitive with dbg for example.
What was the size of the gfa you were trying to load into vg? If vg was using orders of magnitude more memory than the file itself that would sound like an issue with their gfa loading. I will try to reproduce on my end, but I will take some time to do so, so I can build vg and troubleshoot in more detail versus just running their precompiled binaries.
We observed this with the YPRP yeast strains, as well as some public maize genomes. The dbg
, gfa
, and vg
files from the 12 yeast genomes are respectively 55, 178, and 46 MB.
metagraph
used 129 MB of memory to convert the dbg
to gfa
, while vg view
used 4.45 GB to convert that gfa
to vg
.
This caused us to wonder if it might be possible to simultaneously skip both the large intermediate gfa
file and the expensive vg view
conversion step by converting directly from dbg
to vg
using metagraph
.
I'm happy to share the data files, if that would be helpful.
@trgibbons, could you please quickly clarify your exact use case? What is the input to building the graphs? Raw sequencing data, draft assemblies, contigs? Depending on that context, what information should be encoded in the paths?
Thanks!
As a starting point, we are focused on comparing pseudochromosome-quality reference genomes. The paths should be the pseudochromosome sequences.
From what I can tell, we are doing exactly what was demonstrated in the Example from the README:
DATA="../tests/data/transcripts_1000.fa"
./metagraph build -k 12 -o transcripts_1000 $DATA
./metagraph annotate -i transcripts_1000 --anno-filename -o transcripts_1000 $DATA
./metagraph query -i transcripts_1000 -a transcripts_1000.column.annodbg $DATA
./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000
except that we are using the 12 YPRP yeast genomes instead of the transcripts_1000.fa sequences.
I have so far been unable to inspect the dbg
or column.annodbg
files, but I assume that the column.annodbg
files contains the path information for the sequences used to create the dbg
, and am hoping that these files can be combined to generate gfa
and/or vg
files that contain path information for the input sequences.
@trgibbons Conceptually the annotation is a 2-D binary matrix, kmer x annotation_label, where kmer is over the space of all kmers stored in the de Bruijn graph, and annotation labels are chosen by the user. So each annotation label (column) of the matrix implies a subset of kmers (rows)-- those rows marked with 1 as opposed to 0 in that column-- within the overall de Bruijn graph. Annotating a sequence in the graph just finds the matching kmers and marks those rows 1.
In the general case, one such set of rows can match an entirely arbitrary subset of the kmers/rows in the overall de Bruijn graph.
If the annotated sequence was also used when building the graph, as in your example, then all the kmers will match, and the annotated rows (marked as 1 in a column) will correspond to a contiguous part of the overall de Bruijn graph. But even then it could contain loops for example, meaning that even in this case we cannot recover the original sequence or path information from graph+annotation alone. Additional path information would need to be stored, which we are not doing.
Note also that one can annotate a graph with sequences whose kmers are not all present in the graph (unlike your example). As in, metagraph build
with one file or set of files, and metagraph annotate
with another file or file(s). In this case the annotated rows could be arbitrarily dispersed in the overall de Bruijn graph, depending only on which kmers matched. Same as before there is no path information, but now also the annotated portion of the graph could be disconnected.
On to the selection of annotation labels-- I said these are chosen by the user. Some options can be seen in metagraph annotate -h
. But --anno-filename
(as in the example you have referenced) will create one annotation column per file passed on the command line (you can pass multiple), with the label for that column being the filename. If $DATA
is multiple files then you will get multiple columns. The other major option is --anno-header
, where you will get a new label and column for each sequence based on its name in the FASTA.
You can inspect the columns/labels using something like:
./metagraph stats --print-col-names -a test_graph.column.annodbg test_graph.dbg
Or you can omit --print-col-names
and just see a summary of the annotation matrix.
Anyways back to the question about path information and generating gfa/vg-- in the current metagraph I think the best that can be generated from an existing graph+annotation is dump contigs for each column in the annotation, and call those paths. While this could retain a little more path information than unitigs from the full graph alone, it is still very lossy and also depends on how many sequences were fed to each annotation column.
Mentioned at the end of https://github.com/ratschlab/projects2014-metagenome/issues/47#issuecomment-541785010 is some functionality we could add pretty easily for generating path information in gfa/vg/other. But it would be another distinct processing step, just for generating files with path information. In this case the annotation matrix would not be used/needed. It would look like:
Still, the path information would not be encoded in our dbg or annotation files, it would just dump to gfa/vg. And in the naive case we would just include one path for every input sequence, rather than somehow deduplicating overlapping/consistent paths-- point being that even for this proposed functionality it still might not be quite what you expect/want; I'm not sure.
@karasikov We have briefly discussed this in our last meeting. From the description above, I still think that the paths could be encoded using the machinery that Jan developed during his masters. What do you think?
@karasikov We have briefly discussed this in our last meeting. From the description above, I still think that the paths could be encoded using the machinery that Jan developed during his masters. What do you think?
Well, if GFA with paths is all we need, I don't see a point to obfuscate the problem.
The point is that for metagraph to be more useful to those studying collections of reference genomes, there needs to be a way to store paths corresponding to reference chromosomes. Storing all of these paths as a large GFA is inefficient.
I think a better method would be to have all paths encoded succinctly using the method Jan developed, then provide the option to export the graph with a user-chosen subset of these paths as a GFA file.
But since his work isn't "production ready" yet, the quick solution for now would be to have a mode which maps a FASTA file to the graph and outputs the resulting paths in the GFA format.
The point is that for metagraph to be more useful to those studying collections of reference genomes, there needs to be a way to store paths corresponding to reference chromosomes. Storing all of these paths as a large GFA is inefficient.
Please tell me why it is inefficient.
To clarify, that stuff was designed for other purposes.
As far as I understood, this thread is about using Metagraph for building GFA files. This problem can be easily and more efficiently solved by simple mapping. Am I missing something?
It would depend on what's being stored. If a large collection of chromosomes that share similar blocks is being stored, then there would be lots of redundancy in the GFA if they're stored as paths.
The GFA standard does support storing these as variant calls, which would be even more efficient, but this opens its own can of worms if the inputs are reference chromosomes because one would have to compute an MSA to get this.
To clarify, that stuff was designed for other purposes.
As far as I understood, this thread is about using Metagraph for building GFA files. This problem can be easily and more efficiently solved by simple mapping. Am I missing something?
The points I'm bringing up are ideas for future work that would supersede this particular issue. We could solve this one with the simple mapping and open a new feature request.
Ok, I see the point that aligning back a given fasta to the graph to extract the path for GFA is the most immediate solution. Let's keep the construction, permanent representation and export of paths as an important feature to be developed.
Thanks to @chrisbarber for the detailed explanation and to everyone for taking the time to discuss this. I think all of the important points have already been made. I just want to add that the performance of metagraph query
might not scale to hundreds or thousands of large genomes, but it's certainly good enough for us for now. The GFA intermediate is a much more restrictive bottleneck.
Here are more detailed performance statistics from our tests creating a single graph from five complete maize genomes. We ran the jobs on an older, smaller node. We gave them 40 threads, but most of the time they only used a handful of these. The memory usage is based on the maximum values I happened to observe in HTOP over the course of several hours. I did not closely watch HTOP the entire time, so it's possible that some steps used more memory.
k | Build | Annotate | Query |
---|---|---|---|
21 | 12m | 82m | 37m |
31 | 25m | 80m | 35m |
41 | 26m | 79m | 32m |
51 | 56m | 77m | 33m |
k | Build | Annotate | Query |
---|---|---|---|
21 | 80G | 90G | 105G |
31 | 150G | 90G | 200G |
41β | 80G | 40G | 200G |
51 | 300G | 95G | 220G |
β These are the values I observed, but I suspect they grew higher when I wasn't watching
k | .dbg | .edgemask |
---|---|---|
21 | 1.3G | 188M |
31 | 1.8G | 272M |
41 | 2.3G | 347M |
51 | 2.8G | 413M |
Hi @trgibbons, I was just thinking I should give you an update (or "non-update") on the matter of metagraph GFA -> vg conversion:
I am still thinking that the gfa output by metagraph is okay. The segment id's that @karasikov mentioned (https://github.com/ratschlab/projects2014-metagenome/issues/47#issuecomment-541785010) are not as short as they could be, but the memory usage by vg that you were seeing (https://github.com/ratschlab/projects2014-metagenome/issues/47#issuecomment-542199393) is about 40x times the size of the gfa file.
My inclination is to report this issue to vg, but I thought of one thing to try first. From taking a look at what I believe to be vg's loading code (https://github.com/vgteam/vg/blob/65cef16db927008bfbee50c072907c25a3853bf1/src/gfa.cpp#L143-L1075), it looks like they use https://github.com/edawson/gfakluge for loading. Something relatively easy to do would be to try gfakluge command line utilities with gfa's output by metagraph, and see how the memory usage looks. This information might help vg to troubleshoot, or perhaps it could reveal whether gfakluge itself is already using excess memory for a given gfa, in which case the issue could be reported directly to them.
I have gone ahead and done this with a small (~1MB) gfa file. I installed gfakluge via linuxbrew (brew install brewsci/bio/gfakluge
), generated this example gfa using metagraph via
./metagraph build -k 19 --graph succinct -o test_graph ../tests/data/transcripts_100*
./metagraph assemble -o test_graph --unitigs --to-gfa test_graph.dbg
Then run gfak merge
which among the various gfakluge commands appears to be one which actually loads the whole graph into memory (and makes sense that it should need to do so, as opposed to some of the other commands), e.g. by "merging" with just one gfa, or by merging the same gfa with itself:
\time -v gfak merge test_graph.gfa
\time -v gfak merge test_graph.gfa test_graph.gfa
In both cases with this ~1MB gfa, I get around 16MB peak memory. With such a small file though it is not entirely clear, so I would like to repeat the experiment with your gfa file. Do you have a way to share one of your large (e.g. 178MB) metagraph gfa files with me, and I can try this again? Or if it's easier you could repeat the above steps (install gfakluge, run the merge command).
This information I think would be a good start in knowing if/where to report the issue upstream, and nice information for whoever ends up taking it further.
Also, just to reiterate regarding reducing the large gfa file size on disk: it is easy for me to implement a gfa output mode where the sequence data is stored in external compressed fasta, which should help a lot. From a quick look, it appears that gfakluge supports this (e.g. https://github.com/edawson/gfakluge/blob/master/examples.md#replacing-sequence-placeholders-with-fasta-records), which may indicate that vg likewise can load these, but I haven't confirmed this.
Thanks to @chrisbarber for the detailed explanation and to everyone for taking the time to discuss this. I think all of the important points have already been made. I just want to add that the performance of
metagraph query
might not scale to hundreds or thousands of large genomes, but it's certainly good enough for us for now. The GFA intermediate is a much more restrictive bottleneck.Here are more detailed performance statistics from our tests creating a single graph from five complete maize genomes. We ran the jobs on an older, smaller node. We gave them 40 threads, but most of the time they only used a handful of these. The memory usage is based on the maximum values I happened to observe in HTOP over the course of several hours. I did not closely watch HTOP the entire time, so it's possible that some steps used more memory.
Timing
k Build Annotate Query 21 12m 82m 37m 31 25m 80m 35m 41 26m 79m 32m 51 56m 77m 33m
Memory
k Build Annotate Query 21 80G 90G 105G 31 150G 90G 200G 41β 80G 40G 200G 51 300G 95G 220G β These are the values I observed, but I suspect they grew higher when I wasn't watching
Disk space
k .dbg .edgemask 21 1.3G 188M 31 1.8G 272M 41 2.3G 347M 51 2.8G 413M
Hi Ted,
It would be interesting to see the stats of the databases.
./metagraph stats graph.dbg
./metagraph stats -a annotation.-.annodbg
Could you please show these?
What data do you use for querying? Is it the same 5 genomes that you indexed in graphs?
@chrisbarber Here's a temporary Dropbox link to the 178 Mb GFA
file for 12 YPRP yeast genomes:
https://www.dropbox.com/s/h0yccs7j37its8d/yprp_metagraph.gfa.gz?dl=0
@karasikov I started by copying the Example commands directly out of the README.md: https://github.com/ratschlab/projects2014-metagenome#example
I then increased -k
and -p
, and redirected DATA="../tests/data/transcripts_1000.fa"
to one of two files I'm using for testing. The smaller file contains the 12 YPRP yeast genomes. The larger file contains 5 maize genomes (B73 + 4x Flint lines).
Looking at the statistics for the annotations (see below), I'm wondering if it would be better in this case to annotate
the .dbg
files using separate FASTA files for each genome, rather than the combined FASTA file used to build
the graph. I also just noticed that the .column.annodbg
files are all 352 bytes for maize, regardless of _k_mer size, and only slightly larger than the 306 byte .column.annodbg
file created for yeast.
Statistics for graph 10_yeast/10_YPRP/yprp_metagraph.dbg
====================== GRAPH STATS =====================
k: 41
nodes (k): 53059818
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 41
nodes (k-1): 52549133
edges ( k ): 53062693
state: fast
W stats: {'$': 102, 'A': 16425047, 'C': 10081878, 'G': 10114934, 'T': 16440732}
F stats: {'$': 5, 'A': 16420969, 'C': 10106727, 'G': 10096602, 'T': 16438390}
========================================================
Statistics for annotation 10_yeast/10_YPRP/yprp_metagraph.column.annodbg
=================== ANNOTATION STATS ===================
labels: 1
objects: 53059818
density: 1.000000e+00
representation: column
========================================================
Statistics for graph 20_maize/10_maize_five/k21/Zm_B73+Flint_pseudochromosomes_metagraph_k21.dbg
====================== GRAPH STATS =====================
k: 21
nodes (k): 1219948871
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 21
nodes (k-1): 1159980281
edges ( k ): 1220038973
state: fast
W stats: {'$': 7090, 'A': 329729111, 'C': 276264090, 'G': 281211323, 'T': 332827359}
F stats: {'$': 5, 'A': 329564341, 'C': 277138405, 'G': 281026823, 'T': 332309399}
========================================================
Statistics for annotation 20_maize/10_maize_five/k21/Zm_B73+Flint_pseudochromosomes_metagraph_k21.column.annodbg
=================== ANNOTATION STATS ===================
labels: 1
objects: 1219948871
density: 1.000000e+00
representation: column
========================================================
Statistics for graph 20_maize/10_maize_five/k31/Zm_B73+Flint_pseudochromosomes_metagraph_k31.dbg
====================== GRAPH STATS =====================
k: 31
nodes (k): 1761834791
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 31
nodes (k-1): 1710657378
edges ( k ): 1762139673
state: fast
W stats: {'$': 15870, 'A': 473500941, 'C': 403077400, 'G': 408838160, 'T': 476707302}
F stats: {'$': 5, 'A': 473376189, 'C': 404093110, 'G': 408271155, 'T': 476399214}
========================================================
Statistics for annotation 20_maize/10_maize_five/k31/Zm_B73+Flint_pseudochromosomes_metagraph_k31.column.annodbg
=================== ANNOTATION STATS ===================
labels: 1
objects: 1761834791
density: 1.000000e+00
representation: column
========================================================
Statistics for graph 20_maize/10_maize_five/k41/Zm_B73+Flint_pseudochromosomes_metagraph_k41.dbg
====================== GRAPH STATS =====================
k: 41
nodes (k): 2247469043
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 41
nodes (k-1): 2201842265
edges ( k ): 2248059202
state: fast
W stats: {'$': 20968, 'A': 601320298, 'C': 518268216, 'G': 524158892, 'T': 604290828}
F stats: {'$': 5, 'A': 600903365, 'C': 519621122, 'G': 523656742, 'T': 603877968}
========================================================
Statistics for annotation 20_maize/10_maize_five/k41/Zm_B73+Flint_pseudochromosomes_metagraph_k41.column.annodbg
=================== ANNOTATION STATS ===================
labels: 1
objects: 2247469043
density: 1.000000e+00
representation: column
========================================================
Statistics for graph 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.dbg
====================== GRAPH STATS =====================
k: 51
nodes (k): 2680522906
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 51
nodes (k-1): 2640373893
edges ( k ): 2681443628
state: fast
W stats: {'$': 24296, 'A': 714622497, 'C': 621945712, 'G': 627287376, 'T': 717563747}
F stats: {'$': 5, 'A': 714269757, 'C': 623066425, 'G': 627044322, 'T': 717063119}
========================================================
Statistics for annotation 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.column.annodbg
=================== ANNOTATION STATS ===================
labels: 1
objects: 2680522906
density: 1.000000e+00
representation: column
========================================================
@karasikov I started by copying the Example commands directly out of the README.md: https://github.com/ratschlab/projects2014-metagenome#example
I then increased
-k
and-p
, and redirectedDATA="../tests/data/transcripts_1000.fa"
to one of two files I'm using for testing. The smaller file contains the 12 YPRP yeast genomes. The larger file contains 5 maize genomes (B73 + 4x Flint lines).Looking at the statistics for the annotations (see below), I'm wondering if it would be better in this case to
annotate
the.dbg
files using separate FASTA files for each genome, rather than the combined FASTA file used tobuild
the graph. I also just noticed that the.column.annodbg
files are all 352 bytes for maize, regardless of _k_mer size, and only slightly larger than the 306 byte.column.annodbg
file created for yeast....
Zm_B73+Flint_pseudochromosomes_metagraph_k51.dbg
Statistics for graph 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.dbg ====================== GRAPH STATS ===================== k: 51 nodes (k): 2680522906 canonical mode: no ======================================================== ====================== BOSS STATS ====================== k: 51 nodes (k-1): 2640373893 edges ( k ): 2681443628 state: fast W stats: {'$': 24296, 'A': 714622497, 'C': 621945712, 'G': 627287376, 'T': 717563747} F stats: {'$': 5, 'A': 714269757, 'C': 623066425, 'G': 627044322, 'T': 717063119} ======================================================== Statistics for annotation 20_maize/10_maize_five/k51/Zm_B73+Flint_pseudochromosomes_metagraph_k51.column.annodbg =================== ANNOTATION STATS =================== labels: 1 objects: 2680522906 density: 1.000000e+00 representation: column ========================================================
It looks like you annotated all k-mers in graph with the same label. This is why it's easy to compress annotation to less than 1 Kb in size.
You should pass flag --anno-header
when annotating the graph with multiple genomes stored in a single fasta file like:
>genome1
ACGACACTAGTC...
>genome2
CATGACGATGCA...
Alternatively, you can split the genomes into multiple fasta files and build annotations from these with flag --anno-filename
.
We have faster representations for graph (e.g., --graph hash
), but it uses way more memory.
Are you querying the same sequences that you used for graph construction (i.e., all k-mers from the queried sequence are indexed in graph) or something with k-mers not indexed in graph?
The succinct graph representation is slower for k-mer misses, but we are currently working on improving this. For the case where all k-mers from the query are in graph, however, it's hard to speed up queries further. This is the cost we have to pay for storing k-mers in a compressed data structure.
Thanks @trgibbons for providing the gfa. I don't think my experiment has clarified anything, so I will have to think what to do next, but here are the results nonetheless:
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 10000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
User time (seconds): 7.99
Maximum resident set size (kbytes): 12364
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 20000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
User time (seconds): 32.86
Maximum resident set size (kbytes): 22976
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 30000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
User time (seconds): 83.73
Maximum resident set size (kbytes): 33588
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 40000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
User time (seconds): 163.39
Maximum resident set size (kbytes): 44176
[barberc@lm-a2-043 build]$ \time -v gfak merge <(head -n 50000 ~/yprp_metagraph.gfa) 2>&1 | egrep "User|Maximum"
User time (seconds): 267.75
Maximum resident set size (kbytes): 54796
In contrast to what might have been expected, the runtimes are super-linear but the memory seems fine.
What data do you use for querying? Is it the same 5 genomes that you indexed in graphs?
Are you querying the same sequences that you used for graph construction (i.e., all k-mers from the queried sequence are indexed in graph) or something with k-mers not indexed in graph?
For each set of analyses (yeast or maize), I have been putting all of my sequences of interest into a single FASTA file, storing the path to that file in a $SEQS
variable, and then passing that variable to every step (build
, annotate
& query
), because that is how it is shown in the Example in the documentation: https://github.com/ratschlab/projects2014-metagenome#example
Thanks for looking into this and sharing your results. The performance of the GFA --> vg conversion is, for us, secondary to the more fundamental problem of Metagraph currently not encoding path data. I worry now that I unintentionally diverted this conversation away from our primary goal.
The ultimate goal of our group is to create and visualize genome variation graphs for fully-assembled, large, complex genomes. We have received a grant to work on genome graph visualization, but have gotten stuck trying to create the graphs we want to visualize. The other tools we've tried can either handle small (< 1 Gbp) complex genomes, or large datasets from human resequencing projects that narrowly focus on SNPs and other small variants, but cannot handle large (> 1 Gbp) complex genomes.
Part of our team works on agricultural (mostly polyploid plant) genomics. The other part works on human cancer genomics. It is therefore necessary for us to identify large, complex, and often nested structural variants, in addition to SNPs, CNVs, and other small variants.
We are happy to use any tools or intermediate files necessary to obtain these genome graphs, so long as they:
@karasikov
Alternatively, you can split the genomes into multiple fasta files and build annotations from these with flag --anno-filename.
I realized only after reading this that it's even possible to provide the input sequences in separate files. The commands in the metagraph
documentation used only a single FASTA file containing all sequences of interest, so I assumed that metagraph
was like seqwish
in requiring all of the sequences to be combined into a single FASTA file.
If this is not the correct/preferred way to create a graph (with path information), then it would be helpful if the README.md was updated to reflect the best practices.
I have now re-run the metagraph
pipeline with separate and consolidated sequences, and using the --anno-filename
with annotate
each time.
The .dbg
, .column.annodbg
, and .edgemask
files are respectively 42 MB, 66 MB, and 6.3 MB.
Statistics for graph yprp_metagraph.dbg
====================== GRAPH STATS =====================
k: 21
nodes (k): 40368796
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 21
nodes (k-1): 39526212
edges ( k ): 40369120
state: fast
W stats: {'$': 21, 'A': 12467772, 'C': 7694062, 'G': 7708496, 'T': 12498769}
F stats: {'$': 5, 'A': 12468867, 'C': 7703697, 'G': 7709491, 'T': 12487060}
========================================================
Statistics for annotation yprp_metagraph.column.annodbg
=================== ANNOTATION STATS ===================
labels: 12
objects: 40368796
density: 2.871733e-01
representation: column
========================================================
The .dbg
, .column.annodbg
, and .edgemask
files are respectively 42 MB, 306 B, and 6.3 MB.
Statistics for graph yprp_metagraph.dbg
====================== GRAPH STATS =====================
k: 21
nodes (k): 40368796
canonical mode: no
========================================================
====================== BOSS STATS ======================
k: 21
nodes (k-1): 39526212
edges ( k ): 40369120
state: fast
W stats: {'$': 21, 'A': 12467772, 'C': 7694062, 'G': 7708496, 'T': 12498769}
F stats: {'$': 5, 'A': 12468867, 'C': 7703697, 'G': 7709491, 'T': 12487060}
========================================================
Statistics for annotation yprp_metagraph.column.annodbg
=================== ANNOTATION STATS ===================
labels: 1
objects: 40368796
density: 1.000000e+00
representation: column
========================================================
Do the results look any better after separating the input sequences?
Hi @trgibbons
Do the results look any better after separating the input sequences?
The graph is the same, which is expected as you constructed it from the same set of sequences. The annotation, however, did change. Now you indeed have 12 labels that mark the relation between the genomes and their k-mers.
If this is not the correct/preferred way to create a graph (with path information), then it would be helpful if the README.md was updated to reflect the best practices.
Thanks for the suggestion, we will indeed improve the readme as well as describe the usage and the most common use cases in wiki pages.
We are happy to use any tools or intermediate files necessary to obtain these genome graphs, so long as they:
- Can handle very large and very complex genomes
- Retain the path information for these genomes
- Can be imported into vg, which already supports a lot of the functionality we rely on scalable visualization
I believe we can handle large genomes.
@chrisbarber, would it be possible to implement 3. in the nearest future (that is, add support for aligning sequences to graph and extend the GFA file with paths/walks)?
We are happy to use any tools or intermediate files necessary to obtain these genome graphs, so long as they:
- Can handle very large and very complex genomes
- Retain the path information for these genomes
- Can be imported into vg, which already supports a lot of the functionality we rely on scalable visualization
I believe we can handle large genomes.
Exactly π There are several tools capable of the second two points, but the first one is extraordinarily difficult. Metagraph is the first tool I've seen that addresses the first point, which is why I'm harassing you all the most π
@chrisbarber Thanks for looking into this and sharing your results. The performance of the GFA --> vg conversion is, for us, secondary to the more fundamental problem of Metagraph currently not encoding path data. I worry now that I unintentionally diverted this conversation away from our primary goal.
@trgibbons Good to know that you will not necessarily be blocked by the GFA --> vg performance issue. I am pretty confident now that it is not a problem with the metagraph GFA layout (which is quite simple to begin with). I'm happy to be looped-in if you report an issue to them.
@chrisbarber, would it be possible to implement 3. in the nearest future (that is, add support for aligning sequences to graph and extend the GFA file with paths/walks)?
@karasikov Yes, I am glad to do this and I think it will be relatively straightforward. Opened #71 to track.
We've been doing some tests creating graphs from yeast and maize genomes using metagraph, minigraph, and seqwish. So far, metagraph is the only tool to successfully create a graph from maize sequences, but we aren't sure how to inspect or use the output files.
We would greatly appreciate support for one or more standard formats π