jermp / sshash

A compressed, associative, exact, and weighted dictionary for k-mers.
MIT License
83 stars 17 forks source link

Support larger k: up to 63 using __uint128_t #21

Closed jermp closed 1 year ago

jermp commented 1 year ago

We have already tested a prototype here: https://github.com/jermp/sshash/blob/statistics/include/util.hpp.

jermp commented 1 year ago

Now being under development in dev branch.

Some results on an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz.

uint64_t

./sshash build -i ../data/unitigs_stitched/se.ust.k31.fa.gz -k 31 -m 13 --verbose --bench -d tmp_dir
4.76524 [bits/kmer]
avg_nanosec_per_positive_lookup 266.929
avg_nanosec_per_negative_lookup 348.849
avg_nanosec_per_positive_lookup_advanced 268.053
avg_nanosec_per_negative_lookup_advanced 355.228
avg_nanosec_per_access 89.8868
iterator: avg_nanosec_per_kmer 20.672

__uint128_t

./sshash build -i ../data/unitigs_stitched/se.ust.k31.fa.gz -k 31 -m 13 --verbose --bench -d tmp_dir
4.76524 [bits/kmer]
avg_nanosec_per_positive_lookup 311.278
avg_nanosec_per_negative_lookup 405.819
avg_nanosec_per_positive_lookup_advanced 314.951
avg_nanosec_per_negative_lookup_advanced 406.78
avg_nanosec_per_access 93.2151
iterator: avg_nanosec_per_kmer 20.7024

./sshash build -i ../data/unitigs_stitched/se.ust.k47.fa.gz -k 47 -m 13 --verbose --bench -d tmp_dir
3.5449 [bits/kmer]
avg_nanosec_per_positive_lookup 384.655
avg_nanosec_per_negative_lookup 517.015
avg_nanosec_per_positive_lookup_advanced 388.076
avg_nanosec_per_negative_lookup_advanced 516.857
avg_nanosec_per_access 93.8862
iterator: avg_nanosec_per_kmer 20.5568

./sshash build -i ../data/unitigs_stitched/se.ust.k63.fa.gz -k 63 -m 13 --verbose --bench -d tmp_dir
3.07407 [bits/kmer]
avg_nanosec_per_positive_lookup 528.494
avg_nanosec_per_negative_lookup 776.829
avg_nanosec_per_positive_lookup_advanced 529.757
avg_nanosec_per_negative_lookup_advanced 777.09
avg_nanosec_per_access 100.48
iterator: avg_nanosec_per_kmer 19.76
jermp commented 1 year ago

More benchmarks for the pre-processed datasets from here https://zenodo.org/record/7239205#.Y-AM4-zMI-Q (datasets used in the LPHash paper).

Same processor use above: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz.

Build the datasets (regular index version).

#!/bin/bash

./sshash build -i /data2/DNA/lphash_datasets/celegans.k31.unitigs.fa.ust.fa.gz -k 31 -m 16 --verbose -o celegans.k31.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/celegans.k47.unitigs.fa.ust.fa.gz -k 47 -m 20 --verbose -o celegans.k47.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/celegans.k63.unitigs.fa.ust.fa.gz -k 63 -m 20 --verbose -o celegans.k63.sshash -d tmp_dir >> sshash.building_log

./sshash build -i /data2/DNA/lphash_datasets/cod.k31.unitigs.fa.ust.fa.gz -k 31 -m 20 --verbose -o cod.k31.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/cod.k47.unitigs.fa.ust.fa.gz -k 47 -m 22 --verbose -o cod.k47.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/cod.k63.unitigs.fa.ust.fa.gz -k 63 -m 24 --verbose -o cod.k63.sshash -d tmp_dir >> sshash.building_log

./sshash build -i /data2/DNA/lphash_datasets/kestrel.k31.unitigs.fa.ust.fa.gz -k 31 -m 20 --verbose -o kestrel.k31.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/kestrel.k47.unitigs.fa.ust.fa.gz -k 47 -m 22 --verbose -o kestrel.k47.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/kestrel.k63.unitigs.fa.ust.fa.gz -k 63 -m 24 --verbose -o kestrel.k63.sshash -d tmp_dir >> sshash.building_log

./sshash build -i /data2/DNA/lphash_datasets/human.k31.unitigs.fa.ust.fa.gz -k 31 -m 21 --verbose -o human.k31.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/human.k47.unitigs.fa.ust.fa.gz -k 47 -m 23 --verbose -o human.k47.sshash -d tmp_dir >> sshash.building_log
./sshash build -i /data2/DNA/lphash_datasets/human.k63.unitigs.fa.ust.fa.gz -k 63 -m 25 --verbose -o human.k63.sshash -d tmp_dir >> sshash.building_log

Benchmark random query time.

#!/bin/bash

./sshash bench -i celegans.k31.sshash >> sshash.benchmarking_log
./sshash bench -i celegans.k47.sshash >> sshash.benchmarking_log
./sshash bench -i celegans.k63.sshash >> sshash.benchmarking_log

./sshash bench -i cod.k31.sshash >> sshash.benchmarking_log
./sshash bench -i cod.k47.sshash >> sshash.benchmarking_log
./sshash bench -i cod.k63.sshash >> sshash.benchmarking_log

./sshash bench -i kestrel.k31.sshash >> sshash.benchmarking_log
./sshash bench -i kestrel.k47.sshash >> sshash.benchmarking_log
./sshash bench -i kestrel.k63.sshash >> sshash.benchmarking_log

./sshash bench -i human.k31.sshash >> sshash.benchmarking_log
./sshash bench -i human.k47.sshash >> sshash.benchmarking_log
./sshash bench -i human.k63.sshash >> sshash.benchmarking_log

Space and random query time results

C. Elegans:

k=31 -> total: 5.86104 [bits/kmer]
avg_nanosec_per_positive_lookup 672.56
avg_nanosec_per_negative_lookup 769.487
avg_nanosec_per_positive_lookup_advanced 673.869
avg_nanosec_per_negative_lookup_advanced 765.014
avg_nanosec_per_access 235.374
iterator: avg_nanosec_per_kmer 21.3172

k=47 -> total: 4.29059 [bits/kmer]
avg_nanosec_per_positive_lookup 632.762
avg_nanosec_per_negative_lookup 797.329
avg_nanosec_per_positive_lookup_advanced 632.278
avg_nanosec_per_negative_lookup_advanced 792.328
avg_nanosec_per_access 225.471
iterator: avg_nanosec_per_kmer 20.8365

k=63 -> total: 3.51322 [bits/kmer]
avg_nanosec_per_positive_lookup 773.315
avg_nanosec_per_negative_lookup 1003.97
avg_nanosec_per_positive_lookup_advanced 769.169
avg_nanosec_per_negative_lookup_advanced 999.04
avg_nanosec_per_access 222.741
iterator: avg_nanosec_per_kmer 20.7404

Cod:

k=31 -> total: 7.84112 [bits/kmer]
avg_nanosec_per_positive_lookup 898.067
avg_nanosec_per_negative_lookup 1092.76
avg_nanosec_per_positive_lookup_advanced 895.177
avg_nanosec_per_negative_lookup_advanced 1088.43
avg_nanosec_per_access 383.628
iterator: avg_nanosec_per_kmer 21.4225

k=47 -> total: 5.16982 [bits/kmer]
avg_nanosec_per_positive_lookup 966.791
avg_nanosec_per_negative_lookup 1112.06
avg_nanosec_per_positive_lookup_advanced 960.307
avg_nanosec_per_negative_lookup_advanced 1103.87
avg_nanosec_per_access 370.778
iterator: avg_nanosec_per_kmer 20.8347

k=63 -> total: 4.25746 [bits/kmer]
avg_nanosec_per_positive_lookup 1122.16
avg_nanosec_per_negative_lookup 1278.02
avg_nanosec_per_positive_lookup_advanced 1117.1
avg_nanosec_per_negative_lookup_advanced 1271.46
avg_nanosec_per_access 357.135
iterator: avg_nanosec_per_kmer 20.8644

Kestrel:

k=31 -> total: 7.53273 [bits/kmer]
avg_nanosec_per_positive_lookup 891.46
avg_nanosec_per_negative_lookup 1155.6
avg_nanosec_per_positive_lookup_advanced 886.603
avg_nanosec_per_negative_lookup_advanced 1151.08
avg_nanosec_per_access 337.217
iterator: avg_nanosec_per_kmer 21.2581

k=47 -> total: 4.6658 [bits/kmer]
avg_nanosec_per_positive_lookup 895.092
avg_nanosec_per_negative_lookup 1165.64
avg_nanosec_per_positive_lookup_advanced 887.133
avg_nanosec_per_negative_lookup_advanced 1160.92
avg_nanosec_per_access 298.793
iterator: avg_nanosec_per_kmer 20.7885

k=63 -> total: 3.76463 [bits/kmer]
avg_nanosec_per_positive_lookup 965.342
avg_nanosec_per_negative_lookup 1316.82
avg_nanosec_per_positive_lookup_advanced 961.407
avg_nanosec_per_negative_lookup_advanced 1314.21
avg_nanosec_per_access 301.807
iterator: avg_nanosec_per_kmer 20.7445

Human:

k=31 -> total: 8.69547 [bits/kmer]
avg_nanosec_per_positive_lookup 1278.75
avg_nanosec_per_negative_lookup 1521.29
avg_nanosec_per_positive_lookup_advanced 1275.41
avg_nanosec_per_negative_lookup_advanced 1515.07
avg_nanosec_per_access 530.692
iterator: avg_nanosec_per_kmer 21.3888

k=47 -> total: 5.64599 [bits/kmer]
avg_nanosec_per_positive_lookup 1321.27
avg_nanosec_per_negative_lookup 1493.59
avg_nanosec_per_positive_lookup_advanced 1314.92
avg_nanosec_per_negative_lookup_advanced 1487.37
avg_nanosec_per_access 452.39
iterator: avg_nanosec_per_kmer 20.9152

k=63 -> total: 4.6395 [bits/kmer]
avg_nanosec_per_positive_lookup 1408.48
avg_nanosec_per_negative_lookup 1597.89
avg_nanosec_per_positive_lookup_advanced 1403.96
avg_nanosec_per_negative_lookup_advanced 1593.5
avg_nanosec_per_access 427.162
iterator: avg_nanosec_per_kmer 20.8628
jermp commented 1 year ago

All benchmarks here: https://github.com/jermp/sshash/tree/5517a1030c63cdd8ced96d6012638c2094abcd57/benchmarks.

jermp commented 1 year ago

Done as of https://github.com/jermp/sshash/commit/5095ed06206b17dd4fe6ed420dc57cf7658cd744.