jkbonfield / fqzcomp

Fastq compression tool
14 stars 0 forks source link

Inconsistent "-s9" setting #1

Closed KirillKryukov closed 4 years ago

KirillKryukov commented 4 years ago

I noticed that fqzcomp's -s9 setting is inconsistent with the rest of the range.

README.md:

-s Specifies the size of the sequence context. Increasing this will improve compression on large data sets, but each increment in level will quadruple the memory used by the sequence compression steps. Further more increasing it too high may harm compression on small files.

Command line help:

    -s <level>     Sequence compression level. 1-9 [Def. 3]
                   Specifying '+' on the end (eg -s5+) will use
                   models of multiple sizes for improved compression.

It works as expected from -s1 to -s8. However, with -s9 it behaves differently: 1. It consumes tiny amount of RAM, smaller than -s1. 2. It's very fast, again, faster than -s1. 3. Its compression strength is weaker than -s1.

Essentially, -s9 works as if it was -s0. And the strongest compression is instead achieved by -s8 (for large enough inputs).

Data for a 2.76 GB bacterial dataset (source):

Setting Comp. memory (MB) Dec. memory (MB) Comp. time (s) Dec. time (s) Compressed size (B)
-s1 39.55 38.05 74.59 94.39 618,210,871
-s2 41.16 39.85 77.34 100.00 605,055,279
-s3 47.53 45.84 81.32 104.4 570,956,053
-s4 72.41 70.51 104.4 135.7 503,756,475
-s5 170.6 169.1 150.2 181.0 409,817,080
-s6 563.9 562.0 179.2 198.6 319,920,024
-s7 2,137 2,135 194.4 234.7 258,172,962
-s8 8,428 8,426 243.9 312.4 226,192,829
-s9 39.11 37.70 69.84 84.72 664,906,540

The relationship is similar with other test data. Compression strength and speed depends on data, and is therefore less reliable indicator. However, allocated memory always shows similar relationship between -s settings. E.g., even on tiniest inputs fqzcomp will allocate 8 GB with -s8. But with -s9 it consistently uses less RAM than with -s1. Compression memory on a range of data sizes.

This is not a critical issue, just something I did not expect (and that the manual does not explain).

OS is Ubuntu 18.04.1 LTS, total RAM is 128 GB. Let me know if you need any details, or a more rigorous standalone repro script.

jkbonfield commented 4 years ago

It's probably wrapping around a data type as the -s param is val+7. At 9 this hits 16, which for 2 bits per base then equates to 2^32. It was intended this was the maximum, but maybe signed vs unsigned introduced a bug.

To be fair it doesn't normally get much better beyond -s7 or so and really needs a multi-length model (ppm style) and also to start using hash tables instead of direct array indexing to cope with the larger kmer.

I can fix the trivial issue though. Thanks for reporting it.

jkbonfield commented 4 years ago

Now fixed. As expected it was a signed vs unsigned bug. Evidently I'd never checked -s9. My apologies.

KirillKryukov commented 4 years ago

@jkbonfield , thanks for the quick fix!

KirillKryukov commented 4 years ago

@jkbonfield , does -s9 work for you? I finally started testing it and found it to always crash ( #2 ).