Zebrafish SAT assembly - Githubissues

@jt8-sanger how's it going with this?

Here's a possible approach.

Make a graph of the reference: vg construct -r ref.fa -m 32 >ref.vg

This will result in a single-path graph equivalent to the reference.

Here I'm using a node max size 32. GCSA2 and a number of other functions in vg are designed to work with nodes of a fixed maximum size (<1024 I believe). This should be done by default and will be soon: #250.

Now index it for mapping vg index -x ref.xg -g ref.gcsa -k 16 ref.vg

Now align your contigs against it. vg map only reads FASTQ and SAM/BAM/CRAM, or one read per line format (see #251). So convert your contigs into FASTQ. You can use a dummy value for the quality. As soon as I can merge the MEM-like mapper (later today, provided tests complete), I'd recommend this:

vg map -x ref.xg -g ref.gcsa -t 16 -L 10 -f contigs.fq.gz >aln.gam

The alignments can be incorporated back into the reference graph with vg mod.

vg mod -i aln.gam ref.vg >ref+contigs.vg

This may take a lot of memory. If you can, break the task down by reference chromosome. Doing this efficiently is hard, but essential to vg's use, so let me know if you can't do this in something like 100G of RAM. There are some approaches that can make the breakdown by chromosome easier. For instance, it's possible to sort and store the contigs by mapping position using rocksdb vg index -a for storing and vg index -A for dumping out, which can make them easier to work with. The interface for querying them isn't very good though. In the medium term, we should have a solution to make this step of the assembly completely efficient and transparent to the user. There is no reason it can't be except the inefficient way that VG stores the graph in RAM.

Now the resulting graph should contain both the reference and the assembled contigs. You can do various experiments to understand the properties of the assembly. Right now the most-available one is mapping. For instance, taking a new sample and mapping it, or tracing the reads you made the contigs from through the assembly. We also have techniques to deconstruct the graph back into a VCF file, and call variants in the graph, but I'm not sure about these. @adamnovak @glennhickey and @edawson would know better.

Notes about mapper:

The only non-default setting is the -L aka --min-mem-length. It prevents the use of very sort matches for seeding the alignment location. It may been to be default, but I'm not sure yet.

The MEM-like mapper I just implemented finds "forward-maximal exact matches" starting at the beginning of the read. When it encounters a mismatch against the GCSA2 index it steps forward 1 character so as to try to skip over SNPs. For long reads like contigs you should do fine.

However, this is likely to change soon as GCSA2 gains parent() support, which will let us efficiently get true bidirectional maximal exact matches, and in turn focus our efforts on the supramaximal exact matches (SMEMs), basically generalizing bwa mem to work on the graph.

vgteam / vg

Zebrafish SAT assembly #242