jermp / lphash

Fast and compact locality-preserving minimal perfect hashing for k-mer sets.
MIT License
42 stars 2 forks source link

Building Time #21

Closed tbenavi1 closed 1 year ago

tbenavi1 commented 1 year ago

Hello,

The building time for me seems to be much higher than the results in Table 5 of the paper. What value of k corresponds to the results in Table 5? I downloaded both the k=31 and k=63 human files from Zenodo and ran lphash to build the database.

Specifically, I ran:

lphash build-p -i human.k63.unitigs.fa.ust.fa.gz -c 5 -k 63 -m 28 --check -o human_c5_k63_m28.lph --verbose -t 4

How long is this expected to take? Thank you.

jermp commented 1 year ago

Your command seems correct. Note that the "checking" time should not count towards the construction time. For the benchmark in Table 5 of the paper does not include this time (which is needed to check the correctness of the function). Also, be sure you compiled and run the code in Release mode.

How long is this expected to take?

Approximately what we reported in Table 5. (The actual runtime obviously depends on the specific processor.)

Let me know!

tbenavi1 commented 1 year ago

It's been running for over 4 hours so far. The latest line in the log:

2023-07-30 15:14:16: 1259379 buckets done in 8207 seconds (14.7426% of keys, 5% of buckets, 3.72143 trials per bucket, 3.72144 expected trials per bucket)

I also compiled in Release mode according to the instructions on github. I'm running on a computing cluster so the processors should be fast.

jermp commented 1 year ago

That's strange. I cannot replicate your issue on my end. I've just re-built a lphash (partitioned) function with

/usr/bin/time ./lphash build-p -i /data2/DNA/lphash_datasets/human.k63.unitigs.fa.ust.fa.gz -k 63 -m 28 -t 4 -c 5.0 -d tmp_dir -o human.k64.lphash --verbose

and got

1154.18user 17.66system 7:26.27elapsed

so exactly what I reported in the paper.

The line that your reported looks like this on my end

2023-07-31 07:48:17: 1259379 buckets done in 0 seconds (14.7426% of keys, 5% of buckets, 3.72143 trials per bucket, 3.72144 expected trials per bucket) 

0 seconds!

I would do a fresh clone of the repo and re-compilation.

tbenavi1 commented 1 year ago

Hello,

Based on some testing, I think there may be several issues.

First, I think increasing the number of threads above some threshold eventually leads to a performance downgrade.

At one point, when I tried to run with 64 threads, I received the following error after it ran for some time:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  parallel search should use at most 36 threads

However, when I run with 32 threads I no longer receive the error, but the runtime is longer than when I run with 4 threads.

I also receive another error when running the following code on T2T human genome input:

lphash build-p -i chm13v2.0.k25.unitigs.fa.ust.fa.gz -k 25 -m 17 -t 4 -c 5.0 -d tmp3 -o chm13v2.0_c5_k25_m17.lph --verbose
2023-07-31 08:58:06: search ends
 == 1743363 empty buckets (2.71267%)
 == total trials = 13063020969
 == total expected trials = 13062410860
 == search took 304.757 seconds
Part 3: build inverted index
Percentage of maximal super-k-mers: 26.1134%
Percentage of left-maximal super-k-mers: 26.7672%
Percentage of right-maximal super-k-mers : 26.726%
Percentage of unclassified super-k-mers: 20.3935%
Total = 100%
Percentage of ambiguous minimizers: 16.9299%
Part 4: build fallback MPHF
c = 5
alpha = 0.94
num_keys = 955743096
table_size = 1016747974
num_buckets = 160187311
using 8 GB of RAM (0.127094 GB occupied by the bitmap)
using a peak of 28.0423 GB of disk space
 == processed 49% keys from inputterminate called after throwing an instance of 'std::runtime_error'                                                
  what():  seed did not work
Aborted (core dumped)

Additionally, I am wondering if lphash will get confused if you start multiple runs in the same directory (and do not set the temp directory with -d). It seems that when I set the temp directory, lphash seemed to perform better.

The only possible other difference I could see for the performance downgrade would be the order of the commands between the command I ran and the command you ran, but I doubt that that is the issue.

In summary: I can now get lphash to run quicker on your human dataset, though I am only using 4 threads. I cannot get lphash to run on a T2T human data I generated, due to a "seed did not work" error.

jermp commented 1 year ago

However, when I run with 32 threads I no longer receive the error, but the runtime is longer than when I run with 4 threads.

Yes, it could be, due to contention. PTHash (the general purpose MPHF that LPHash is using for the minimizers) uses a complicated parallel search procedure. If you want to boost scalability, you should use a PTHash-HEM function. I can explain to you who to use it if you want.

I never tried LPHash on k=25. Could you try to use the uint64_t type here https://github.com/jermp/lphash/blob/main/include/compile_constants.tpd instead of __uint128_t ?

tbenavi1 commented 1 year ago

I switched to uint64_t in compile_constants.tpd and recompiled. This time, it worked! I did not get a "seed did not work" error.

I would also be interested in the PTHash-HEM function. Thanks.

jermp commented 1 year ago

Ok, cool! I can explain via emails if you prefer. In the meanwhile, refer to the HEM description here https://arxiv.org/pdf/2106.02350.pdf.

tbenavi1 commented 1 year ago

Thanks. Now I am running into another issue. It seems like lphash is not generating a minimal perfect hash. Specifically, there are several different kmers which are all hashing to the same value. For example: TAACTGTGAGATGTAAGCACACATC, TTTTATTTTGCAGTAATGGCCAATA, and TTTTCCTGAAGATGTATCTATGACC all hash to 2391456533. As far as I'm aware, these three kmers were all part of the original input that was used to generate the lphash file. It seems like the hashes are close to a minimal perfect hash, but there are still some gaps and repeats.

Please let me know if you would like me to send you any input files for testing. I am currently just running k=25 with the T2T human genome. I can also see if k=31 has the same problem.

jermp commented 1 year ago

We never tested any k < 31, so please send me your input file. CC: @yhhshb.

Also, why can't you use k=31?

tbenavi1 commented 1 year ago

I just emailed you my input file. I am also in the process of running with k=31. I will let you know my results. Thanks.

jermp commented 1 year ago

Hello, I tried your file with

/usr/bin/time -v ./lphash build-p -i chm13v2.0.k25.unitigs.fa.ust.fa.gz -k 25 -m 17 -t 8 -c 7.0 -d tmp_dir -o test.lphash --verbose --check

and everything looks good. No hash collisions: the --check flag enables the correctness checks that did not report any problem.

I've also indexes all kmers using SSHash and again I did not see any collisions. The SSHash command used was:

./sshash build -i ~/lphash/build/chm13v2.0.k25.unitigs.fa.ust.fa.gz -k 25 -m 17 -o chm13.sshash -d tmp --verbose --check
tbenavi1 commented 1 year ago

Thank you so much. I am able to get lphash to work properly on my files now. I believe my issue was the compilation constant as described above.

I did want to let you know of two other bugs I found: If you run the query command on a fasta file containing a sequence that is smaller than the kmer size k, lphash crashes. Also, I tried to install lphash on a different cluster with an older operating system, but I get the following error when trying to run lphash:

Illegal instruction (core dumped)

I'm not sure if there is some os/architecture requirement. lphash works for "Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz" running "Rocky Linux 8.4 (Green Obsidian)" but not "Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz" running "CentOS Linux 7 (Core)".

jermp commented 1 year ago

I did want to let you know of two other bugs I found: If you run the query command on a fasta file containing a sequence that is smaller than the kmer size k, lphash crashes.

Thanks for reporting. Yeah, most likely a trivial edge case that we did not care about. We will fix it.

The Illegal Instruction error is common nowadays, due to the different architectures. That could be caused, e.g., by PDEP or POPCNT missing.

Let me know if you need further assistance.