luizirber / niffler

Simple and transparent support for compressed files.
Apache License 2.0
75 stars 7 forks source link

performance opening gzip with from_path() and serde_json::from_reader() #55

Closed dholth closed 2 years ago

dholth commented 2 years ago

Describe the bug

I was surprised that serde_json::from_reader(niffler from_path) was slower than Python's json.load(gzip.open("repodata.json.gz"))

$ time python3 baseline.py 

real    0m2.863s
user    0m2.648s
sys     0m0.209s

$ time pypy3 baseline.py 

real    0m1.680s
user    0m1.500s
sys     0m0.180s

$ cargo build -r ; time ./target/release/nifftime 
   Compiling nifftime v0.1.0 (/home/dholth/prog/nifftime)
    Finished release [optimized] target(s) in 0.71s

real    0m10.187s
user    0m9.917s
sys     0m0.270s

It helps if you wrap the niffler reader in a let mut buffered = BufReader::new(reader);, for a time of real 0m3.448s

On the other hand, decompressing to std::io::sink is very fast.

nifftime.zip.zip

natir commented 2 years ago

@dholth first of all thank for your interest on niffler project.

I reproduce your result. I work on a benchmark to compare data reading between a flate file and all compression available in niffler (flate file seems to be too fast compare to compressed file).

I think that these bad performances are not due to niffler which in itself does not do much. But maybe because of the library used in backend. A modification of build system (#54) is planned to allow the user to choose which backend features are activated. I think this could help performance.

natir commented 2 years ago

We already have benchmark to compare parsing with niffler and directly with backend crate, niffler is little bit slower but not by a factor of 10 as you have shown in your benchmark.

dholth commented 2 years ago

I compared pypy (Python with JIT) deserializing a gzipped 170Mb https://conda.anaconda.org/conda-forge/linux-64/repodata.json against niffler not-wrapped-by-a-BufReader.

I understand that the pure-Rust zlib is a bit slow. I would like to try the cloudflare zlib.

pypy  1.6s value = json.load(gzip.open("repodata.json.gz"))
 py3  2.9s value = json.load(gzip.open("repodata.json.gz"))
rust  3.4s (buffered)   let _value: Value = serde_json::from_reader(&mut BufRead::new(&niffler_reader)).expect("could not parse");
rust 10.2s (unbuffered) let _value: Value = serde_json::from_reader(&mut niffler_reader).expect("could not parse");
natir commented 2 years ago

Diff between buffered and unbuffered is logic, I agree default zlib isn't the best one. If you use branch transitive_feature, you have access to feature to control which zlib is use. I will try to finish pull request #54 quickly.

dholth commented 2 years ago

I think the serde access pattern is harder on the unbuffered logic than simply copying to e.g. https://doc.rust-lang.org/std/io/fn.sink.html

natir commented 2 years ago

Ok I write a small benchmark with a "usage" of json value in branch json_parse.

wget https://conda.anaconda.org/conda-forge/linux-64/repodata.json
gzip -k repodata.json
bzip2 -k repodata.json
xz -k repodata.json
cargo build --release --bin json
hyperfine "./target/release/json 1 repodata.json" "./target/release/json 0 repodata.json" "./target/release/json 1 repodata.json.bz2" "./target/release/json 0 repodata.json.bz2" "./target/release/json 1 repodata.json.gz" "./target/release/json 0 repodata.json.gz" "./target/release/json 1 repodata.json.xz" "./target/release/json 0 repodata.json.xz"

I get this result:

Benchmark 1: ./target/release/json 1 repodata.json
  Time (mean ± σ):      2.779 s ±  0.015 s    [User: 2.573 s, System: 0.194 s]
  Range (min … max):    2.762 s …  2.800 s    10 runs

Benchmark 2: ./target/release/json 0 repodata.json
  Time (mean ± σ):     39.408 s ±  0.116 s    [User: 6.217 s, System: 33.056 s]
  Range (min … max):   39.250 s … 39.656 s    10 runs

Benchmark 3: ./target/release/json 1 repodata.json.bz2
  Time (mean ± σ):      6.112 s ±  0.040 s    [User: 5.881 s, System: 0.206 s]
  Range (min … max):    6.055 s …  6.171 s    10 runs

Benchmark 4: ./target/release/json 0 repodata.json.bz2
  Time (mean ± σ):      6.094 s ±  0.072 s    [User: 5.860 s, System: 0.208 s]
  Range (min … max):    6.021 s …  6.241 s    10 runs

Benchmark 5: ./target/release/json 1 repodata.json.gz
  Time (mean ± σ):      6.489 s ±  0.033 s    [User: 6.276 s, System: 0.197 s]
  Range (min … max):    6.450 s …  6.560 s    10 runs

Benchmark 6: ./target/release/json 0 repodata.json.gz
  Time (mean ± σ):      6.533 s ±  0.085 s    [User: 6.305 s, System: 0.204 s]
  Range (min … max):    6.470 s …  6.763 s    10 runs

Benchmark 7: ./target/release/json 1 repodata.json.xz
  Time (mean ± σ):     11.254 s ±  0.054 s    [User: 11.003 s, System: 0.214 s]
  Range (min … max):   11.165 s … 11.341 s    10 runs

Benchmark 8: ./target/release/json 0 repodata.json.xz
  Time (mean ± σ):     11.243 s ±  0.037 s    [User: 10.985 s, System: 0.228 s]
  Range (min … max):   11.198 s … 11.335 s    10 runs

Summary
  './target/release/json 1 repodata.json' ran
    2.19 ± 0.03 times faster than './target/release/json 0 repodata.json.bz2'
    2.20 ± 0.02 times faster than './target/release/json 1 repodata.json.bz2'
    2.33 ± 0.02 times faster than './target/release/json 1 repodata.json.gz'
    2.35 ± 0.03 times faster than './target/release/json 0 repodata.json.gz'
    4.05 ± 0.03 times faster than './target/release/json 0 repodata.json.xz'
    4.05 ± 0.03 times faster than './target/release/json 1 repodata.json.xz'
   14.18 ± 0.09 times faster than './target/release/json 0 repodata.json'

Except for not compressed file, buffer version (1) and not buffer version (0), have almost same time, for me it's logic because stream decompressing algorithm use buffer.

It's not solve our performance trouble compare to python but it's a base of work

natir commented 2 years ago

Comparison against python script:

Benchmark 1: ./target/release/json 1 repodata.json.gz
  Time (mean ± σ):      6.581 s ±  0.114 s    [User: 6.356 s, System: 0.200 s]
  Range (min … max):    6.488 s …  6.830 s    10 runs

Benchmark 2: ./target/release/json 0 repodata.json.gz
  Time (mean ± σ):      6.536 s ±  0.054 s    [User: 6.323 s, System: 0.194 s]
  Range (min … max):    6.483 s …  6.674 s    10 runs

Benchmark 3: ./script.py repodata.json.gz
  Time (mean ± σ):      1.990 s ±  0.019 s    [User: 1.711 s, System: 0.272 s]
  Range (min … max):    1.970 s …  2.030 s    10 runs

Benchmark 4: pypy script.py repodata.json.gz
  Time (mean ± σ):      1.219 s ±  0.017 s    [User: 0.951 s, System: 0.251 s]
  Range (min … max):    1.200 s …  1.244 s    10 runs

Summary
  'pypy script.py repodata.json.gz' ran
    1.63 ± 0.03 times faster than './script.py repodata.json.gz'
    5.36 ± 0.09 times faster than './target/release/json 0 repodata.json.gz'
    5.40 ± 0.12 times faster than './target/release/json 1 repodata.json.gz'

Use feature "zlib":

➜ hyperfine "./target/release/json 1 repodata.json.gz" "./target/release/json 0 repodata.json.gz" "./script.py repodata.json.gz" "pypy script.py repodata.json.gz"
Benchmark 1: ./target/release/json 1 repodata.json.gz
  Time (mean ± σ):      8.114 s ±  0.047 s    [User: 7.884 s, System: 0.206 s]
  Range (min … max):    8.029 s …  8.195 s    10 runs

Benchmark 2: ./target/release/json 0 repodata.json.gz
  Time (mean ± σ):      8.139 s ±  0.025 s    [User: 7.906 s, System: 0.201 s]
  Range (min … max):    8.102 s …  8.175 s    10 runs

Benchmark 3: ./script.py repodata.json.gz
  Time (mean ± σ):      1.988 s ±  0.015 s    [User: 1.704 s, System: 0.276 s]
  Range (min … max):    1.967 s …  2.016 s    10 runs

Benchmark 4: pypy script.py repodata.json.gz
  Time (mean ± σ):      1.236 s ±  0.014 s    [User: 0.974 s, System: 0.247 s]
  Range (min … max):    1.220 s …  1.260 s    10 runs

Summary
  'pypy script.py repodata.json.gz' ran
    1.61 ± 0.02 times faster than './script.py repodata.json.gz'
    6.56 ± 0.08 times faster than './target/release/json 1 repodata.json.gz'
    6.58 ± 0.08 times faster than './target/release/json 0 repodata.json.gz'

Other features, zlib-ng-compat, cloudfare_zlib perform horrible time (more than 3 minutes for one run).

dholth commented 2 years ago

This is what I meant about buffering.

} else if buffer == "2" {
        let value: serde_json::Value =
        serde_json::from_reader(std::io::BufReader::new(niffler::from_path(json_path).unwrap().0)).unwrap();
    count += value.as_object().unwrap().len();
    }

On my slow laptop,

$ time target/release/json 2 ../repodata-000.json.gz 
2 ../repodata-000.json.gz
5

real    0m4.732s
user    0m4.201s
sys     0m0.530s

$ time target/release/json 1 ../repodata-000.json.gz 
1 ../repodata-000.json.gz
5

real    0m13.151s
user    0m12.368s
sys     0m0.782s

Thank you for spending so much time looking into this issue.

natir commented 2 years ago

niffler::from_pathalready use BufReader. For me add a BufReader on another BufReader haven't any impact but your example show an impact, maybe because the first buffer contains compress data and second buffer uncompress data. So uncompress buffer is more consume than compress buffer.

Did you use default gzip compression level ?

dholth commented 2 years ago

Yes I used the default level

natir commented 2 years ago

Ok add a buffer on uncompress reader seems to be a very good idea

Benchmark 1: ./target/release/json 2 repodata.json
  Time (mean ± σ):      2.352 s ±  0.020 s    [User: 2.135 s, System: 0.204 s]
  Range (min … max):    2.321 s …  2.385 s    10 runs

Benchmark 2: ./target/release/json 1 repodata.json
  Time (mean ± σ):      2.843 s ±  0.028 s    [User: 2.607 s, System: 0.219 s]
  Range (min … max):    2.805 s …  2.886 s    10 runs

Benchmark 3: ./target/release/json 0 repodata.json
  Time (mean ± σ):     32.659 s ±  0.132 s    [User: 6.204 s, System: 26.310 s]
  Range (min … max):   32.418 s … 32.874 s    10 runs

Benchmark 4: ./target/release/json 2 repodata.json.bz2
  Time (mean ± σ):      5.072 s ±  0.077 s    [User: 4.813 s, System: 0.224 s]
  Range (min … max):    4.973 s …  5.191 s    10 runs

Benchmark 5: ./target/release/json 1 repodata.json.bz2
  Time (mean ± σ):      6.306 s ±  0.103 s    [User: 6.038 s, System: 0.226 s]
  Range (min … max):    6.150 s …  6.479 s    10 runs

Benchmark 6: ./target/release/json 0 repodata.json.bz2
  Time (mean ± σ):      6.318 s ±  0.084 s    [User: 6.041 s, System: 0.227 s]
  Range (min … max):    6.207 s …  6.485 s    10 runs

Benchmark 7: ./target/release/json 2 repodata.json.gz
  Time (mean ± σ):      2.596 s ±  0.029 s    [User: 2.362 s, System: 0.214 s]
  Range (min … max):    2.555 s …  2.666 s    10 runs

Benchmark 8: ./target/release/json 1 repodata.json.gz
  Time (mean ± σ):      6.691 s ±  0.080 s    [User: 6.414 s, System: 0.226 s]
  Range (min … max):    6.598 s …  6.885 s    10 runs

Benchmark 9: ./target/release/json 0 repodata.json.gz
  Time (mean ± σ):      6.663 s ±  0.053 s    [User: 6.409 s, System: 0.216 s]
  Range (min … max):    6.597 s …  6.742 s    10 runs

Benchmark 10: ./target/release/json 2 repodata.json.xz
  Time (mean ± σ):      3.592 s ±  0.150 s    [User: 3.352 s, System: 0.217 s]
  Range (min … max):    3.466 s …  3.779 s    10 runs

Benchmark 11: ./target/release/json 1 repodata.json.xz
  Time (mean ± σ):     11.597 s ±  0.055 s    [User: 11.294 s, System: 0.230 s]
  Range (min … max):   11.505 s … 11.652 s    10 runs

Benchmark 12: ./target/release/json 0 repodata.json.xz
  Time (mean ± σ):     11.673 s ±  0.088 s    [User: 11.364 s, System: 0.229 s]
  Range (min … max):   11.544 s … 11.781 s    10 runs

pypy is still 2 times faster than rust:

Benchmark 1: ./target/release/json 2 repodata.json.gz
  Time (mean ± σ):      2.600 s ±  0.021 s    [User: 2.369 s, System: 0.208 s]
  Range (min … max):    2.571 s …  2.640 s    10 runs

Benchmark 2: ./target/release/json 1 repodata.json.gz
  Time (mean ± σ):      6.643 s ±  0.045 s    [User: 6.378 s, System: 0.211 s]
  Range (min … max):    6.567 s …  6.725 s    10 runs

Benchmark 3: ./target/release/json 0 repodata.json.gz
  Time (mean ± σ):      6.652 s ±  0.047 s    [User: 6.390 s, System: 0.223 s]
  Range (min … max):    6.592 s …  6.750 s    10 runs

Benchmark 4: ./script.py repodata.json.gz
  Time (mean ± σ):      2.188 s ±  0.032 s    [User: 1.874 s, System: 0.298 s]
  Range (min … max):    2.143 s …  2.244 s    10 runs

Benchmark 5: pypy script.py repodata.json.gz
  Time (mean ± σ):      1.316 s ±  0.021 s    [User: 1.003 s, System: 0.286 s]
  Range (min … max):    1.298 s …  1.365 s    10 runs

Summary
  'pypy script.py repodata.json.gz' ran
    1.66 ± 0.04 times faster than './script.py repodata.json.gz'
    1.98 ± 0.03 times faster than './target/release/json 2 repodata.json.gz'
    5.05 ± 0.09 times faster than './target/release/json 1 repodata.json.gz'
    5.05 ± 0.09 times faster than './target/release/json 0 repodata.json.gz'

zlib-ng and cloudfare_zlib improve a little previous result.

dholth commented 2 years ago

Thanks again. ./target/release/json 2 is pretty good. pypy's json parser is probably not comparable to serde (very different, and doing optimizations that Rust would not consider)