Implement GBZ database backend

adamnovak commented 9 months ago

@jltsiren has a version of the GBZ that is actually a database (meaning it supports fast random access, potentially by URL), and he has code that can query coordinates and dump a subgraph and the local paths, with which one was the reference indicated.

We should add this as a backend for the tube map. Instead of vg chunk, we could call Jouni's implementation when we see our graph is a GBZ database. If the database is a URL, we should use Jouni's code against that URL, rather than downloading it to the server and running against it locally.

Ideally, we would do the extraction client-side in WebAssembly, but to start we might have to run Jouni's tool on the server, until we can get a WASM build of Jouni's implementation.

jltsiren commented 9 months ago

The current prototype is a separate repository, but I'm eventually going to make it an optional feature in the Rust GBWT.

jltsiren commented 9 months ago

If you want to try it, the prototype is starting to be somewhat functional. Things can change without warning, so you may have to update gbz-base and its dependencies and rebuild the database frequently. You can do that with the following commands, and even 8 GB memory should be enough with a HPRC v1.1 graph:

git pull
cargo update
cargo build --release
target/release/gbz2db --overwrite graph.gbz

You can query the database like this:

target/release/query --sample GRCh38 --contig chr1 --interval 1234500..1234600 \
    --context 100 --distinct graph.gbz.db > out.gfa

Only generic and reference paths are indexed for querying. You can replace --interval with --offset if necessary. Sequence offsets are 0-based and intervals are semiopen. All haplotypes except the queried one will have sample name unknown. Option --distinct outputs only distinct haplotypes and stores the weight (number of duplicates) using tag WT.

The above command should output something like this:

H       VN:Z:1.1        RS:Z:GRCh38
S       49698   TGCAGGTCACTGACCCACTGACCACAGATCACCACTCTCTCCCGCCCCCGTTTCTTCCTCCTGAGACCACTGCTGCCTTCAAA
S       49699   G
S       49700   T
S       49701   GTCCAAGAACAAACTGTGCACACACGGGCGGTCAGTCCTGGATTCAAGGTGGCCCATCGGAGCTCGAACTCAACGGCCTGACCTAGGGACGCCTCCAGTCCCACCCCTACCTTTTGGAGGTGGGAGAATCACC
S       49702   C
S       49703   T
S       49704   GGGTCCCCAGCCCTCGATCCTCACGAGTCCT
S       49705   G
S       49706   A
S       49707   TTCCTCAAGCAATTCCAAACTCCAGAGAAGACATCCCAGGTAAAATCGGATGGCAGGGACCTGGGACACCGCTGTGACGGGCAGCCCCTGGGCTCGAGTCCACAACTCCCACGGGAGGGCAGGGCCCCATCCTGCTCACACC
L       49698   +       49699   +       *
L       49698   +       49700   -       *
L       49699   +       49701   +       *
L       49700   -       49701   +       *
L       49701   +       49702   +       *
L       49701   +       49703   +       *
L       49702   +       49704   +       *
L       49703   +       49704   +       *
L       49704   +       49705   +       *
L       49704   +       49706   +       *
L       49705   +       49707   +       *
L       49706   +       49707   +       *
W       GRCh38  0       chr1    1234375 1234767 >49698>49699>49701>49702>49704>49705>49707      WT:i:79
W       unknown 1       chr1    0       392     >49698>49699>49701>49702>49704>49706>49707      WT:i:5
W       unknown 2       chr1    0       392     >49698>49699>49701>49703>49704>49705>49707      WT:i:5
W       unknown 3       chr1    0       392     >49698<49700>49701>49702>49704>49705>49707      WT:i:1

vgteam / sequenceTubeMap

Implement GBZ database backend #367