Open adamnovak opened 9 months ago
The current prototype is a separate repository, but I'm eventually going to make it an optional feature in the Rust GBWT.
If you want to try it, the prototype is starting to be somewhat functional. Things can change without warning, so you may have to update gbz-base and its dependencies and rebuild the database frequently. You can do that with the following commands, and even 8 GB memory should be enough with a HPRC v1.1 graph:
git pull
cargo update
cargo build --release
target/release/gbz2db --overwrite graph.gbz
You can query the database like this:
target/release/query --sample GRCh38 --contig chr1 --interval 1234500..1234600 \
--context 100 --distinct graph.gbz.db > out.gfa
Only generic and reference paths are indexed for querying. You can replace --interval
with --offset
if necessary. Sequence offsets are 0-based and intervals are semiopen. All haplotypes except the queried one will have sample name unknown
. Option --distinct
outputs only distinct haplotypes and stores the weight (number of duplicates) using tag WT
.
The above command should output something like this:
H VN:Z:1.1 RS:Z:GRCh38
S 49698 TGCAGGTCACTGACCCACTGACCACAGATCACCACTCTCTCCCGCCCCCGTTTCTTCCTCCTGAGACCACTGCTGCCTTCAAA
S 49699 G
S 49700 T
S 49701 GTCCAAGAACAAACTGTGCACACACGGGCGGTCAGTCCTGGATTCAAGGTGGCCCATCGGAGCTCGAACTCAACGGCCTGACCTAGGGACGCCTCCAGTCCCACCCCTACCTTTTGGAGGTGGGAGAATCACC
S 49702 C
S 49703 T
S 49704 GGGTCCCCAGCCCTCGATCCTCACGAGTCCT
S 49705 G
S 49706 A
S 49707 TTCCTCAAGCAATTCCAAACTCCAGAGAAGACATCCCAGGTAAAATCGGATGGCAGGGACCTGGGACACCGCTGTGACGGGCAGCCCCTGGGCTCGAGTCCACAACTCCCACGGGAGGGCAGGGCCCCATCCTGCTCACACC
L 49698 + 49699 + *
L 49698 + 49700 - *
L 49699 + 49701 + *
L 49700 - 49701 + *
L 49701 + 49702 + *
L 49701 + 49703 + *
L 49702 + 49704 + *
L 49703 + 49704 + *
L 49704 + 49705 + *
L 49704 + 49706 + *
L 49705 + 49707 + *
L 49706 + 49707 + *
W GRCh38 0 chr1 1234375 1234767 >49698>49699>49701>49702>49704>49705>49707 WT:i:79
W unknown 1 chr1 0 392 >49698>49699>49701>49702>49704>49706>49707 WT:i:5
W unknown 2 chr1 0 392 >49698>49699>49701>49703>49704>49705>49707 WT:i:5
W unknown 3 chr1 0 392 >49698<49700>49701>49702>49704>49705>49707 WT:i:1
@jltsiren has a version of the GBZ that is actually a database (meaning it supports fast random access, potentially by URL), and he has code that can query coordinates and dump a subgraph and the local paths, with which one was the reference indicated.
We should add this as a backend for the tube map. Instead of
vg chunk
, we could call Jouni's implementation when we see our graph is a GBZ database. If the database is a URL, we should use Jouni's code against that URL, rather than downloading it to the server and running against it locally.Ideally, we would do the extraction client-side in WebAssembly, but to start we might have to run Jouni's tool on the server, until we can get a WASM build of Jouni's implementation.