Closed dholth closed 2 years ago
@dholth first of all thank for your interest on niffler project.
I reproduce your result. I work on a benchmark to compare data reading between a flate file and all compression available in niffler (flate file seems to be too fast compare to compressed file).
I think that these bad performances are not due to niffler which in itself does not do much. But maybe because of the library used in backend. A modification of build system (#54) is planned to allow the user to choose which backend features are activated. I think this could help performance.
We already have benchmark to compare parsing with niffler and directly with backend crate, niffler is little bit slower but not by a factor of 10 as you have shown in your benchmark.
I compared pypy (Python with JIT) deserializing a gzipped 170Mb https://conda.anaconda.org/conda-forge/linux-64/repodata.json
against niffler not-wrapped-by-a-BufReader.
I understand that the pure-Rust zlib is a bit slow. I would like to try the cloudflare zlib.
pypy 1.6s value = json.load(gzip.open("repodata.json.gz"))
py3 2.9s value = json.load(gzip.open("repodata.json.gz"))
rust 3.4s (buffered) let _value: Value = serde_json::from_reader(&mut BufRead::new(&niffler_reader)).expect("could not parse");
rust 10.2s (unbuffered) let _value: Value = serde_json::from_reader(&mut niffler_reader).expect("could not parse");
Diff between buffered and unbuffered is logic, I agree default zlib isn't the best one. If you use branch transitive_feature
, you have access to feature to control which zlib is use. I will try to finish pull request #54 quickly.
I think the serde access pattern is harder on the unbuffered logic than simply copying to e.g. https://doc.rust-lang.org/std/io/fn.sink.html
Ok I write a small benchmark with a "usage" of json value in branch json_parse
.
wget https://conda.anaconda.org/conda-forge/linux-64/repodata.json
gzip -k repodata.json
bzip2 -k repodata.json
xz -k repodata.json
cargo build --release --bin json
hyperfine "./target/release/json 1 repodata.json" "./target/release/json 0 repodata.json" "./target/release/json 1 repodata.json.bz2" "./target/release/json 0 repodata.json.bz2" "./target/release/json 1 repodata.json.gz" "./target/release/json 0 repodata.json.gz" "./target/release/json 1 repodata.json.xz" "./target/release/json 0 repodata.json.xz"
I get this result:
Benchmark 1: ./target/release/json 1 repodata.json
Time (mean ± σ): 2.779 s ± 0.015 s [User: 2.573 s, System: 0.194 s]
Range (min … max): 2.762 s … 2.800 s 10 runs
Benchmark 2: ./target/release/json 0 repodata.json
Time (mean ± σ): 39.408 s ± 0.116 s [User: 6.217 s, System: 33.056 s]
Range (min … max): 39.250 s … 39.656 s 10 runs
Benchmark 3: ./target/release/json 1 repodata.json.bz2
Time (mean ± σ): 6.112 s ± 0.040 s [User: 5.881 s, System: 0.206 s]
Range (min … max): 6.055 s … 6.171 s 10 runs
Benchmark 4: ./target/release/json 0 repodata.json.bz2
Time (mean ± σ): 6.094 s ± 0.072 s [User: 5.860 s, System: 0.208 s]
Range (min … max): 6.021 s … 6.241 s 10 runs
Benchmark 5: ./target/release/json 1 repodata.json.gz
Time (mean ± σ): 6.489 s ± 0.033 s [User: 6.276 s, System: 0.197 s]
Range (min … max): 6.450 s … 6.560 s 10 runs
Benchmark 6: ./target/release/json 0 repodata.json.gz
Time (mean ± σ): 6.533 s ± 0.085 s [User: 6.305 s, System: 0.204 s]
Range (min … max): 6.470 s … 6.763 s 10 runs
Benchmark 7: ./target/release/json 1 repodata.json.xz
Time (mean ± σ): 11.254 s ± 0.054 s [User: 11.003 s, System: 0.214 s]
Range (min … max): 11.165 s … 11.341 s 10 runs
Benchmark 8: ./target/release/json 0 repodata.json.xz
Time (mean ± σ): 11.243 s ± 0.037 s [User: 10.985 s, System: 0.228 s]
Range (min … max): 11.198 s … 11.335 s 10 runs
Summary
'./target/release/json 1 repodata.json' ran
2.19 ± 0.03 times faster than './target/release/json 0 repodata.json.bz2'
2.20 ± 0.02 times faster than './target/release/json 1 repodata.json.bz2'
2.33 ± 0.02 times faster than './target/release/json 1 repodata.json.gz'
2.35 ± 0.03 times faster than './target/release/json 0 repodata.json.gz'
4.05 ± 0.03 times faster than './target/release/json 0 repodata.json.xz'
4.05 ± 0.03 times faster than './target/release/json 1 repodata.json.xz'
14.18 ± 0.09 times faster than './target/release/json 0 repodata.json'
Except for not compressed file, buffer version (1
) and not buffer version (0
), have almost same time, for me it's logic because stream decompressing algorithm use buffer.
It's not solve our performance trouble compare to python but it's a base of work
Comparison against python script:
Benchmark 1: ./target/release/json 1 repodata.json.gz
Time (mean ± σ): 6.581 s ± 0.114 s [User: 6.356 s, System: 0.200 s]
Range (min … max): 6.488 s … 6.830 s 10 runs
Benchmark 2: ./target/release/json 0 repodata.json.gz
Time (mean ± σ): 6.536 s ± 0.054 s [User: 6.323 s, System: 0.194 s]
Range (min … max): 6.483 s … 6.674 s 10 runs
Benchmark 3: ./script.py repodata.json.gz
Time (mean ± σ): 1.990 s ± 0.019 s [User: 1.711 s, System: 0.272 s]
Range (min … max): 1.970 s … 2.030 s 10 runs
Benchmark 4: pypy script.py repodata.json.gz
Time (mean ± σ): 1.219 s ± 0.017 s [User: 0.951 s, System: 0.251 s]
Range (min … max): 1.200 s … 1.244 s 10 runs
Summary
'pypy script.py repodata.json.gz' ran
1.63 ± 0.03 times faster than './script.py repodata.json.gz'
5.36 ± 0.09 times faster than './target/release/json 0 repodata.json.gz'
5.40 ± 0.12 times faster than './target/release/json 1 repodata.json.gz'
Use feature "zlib":
➜ hyperfine "./target/release/json 1 repodata.json.gz" "./target/release/json 0 repodata.json.gz" "./script.py repodata.json.gz" "pypy script.py repodata.json.gz"
Benchmark 1: ./target/release/json 1 repodata.json.gz
Time (mean ± σ): 8.114 s ± 0.047 s [User: 7.884 s, System: 0.206 s]
Range (min … max): 8.029 s … 8.195 s 10 runs
Benchmark 2: ./target/release/json 0 repodata.json.gz
Time (mean ± σ): 8.139 s ± 0.025 s [User: 7.906 s, System: 0.201 s]
Range (min … max): 8.102 s … 8.175 s 10 runs
Benchmark 3: ./script.py repodata.json.gz
Time (mean ± σ): 1.988 s ± 0.015 s [User: 1.704 s, System: 0.276 s]
Range (min … max): 1.967 s … 2.016 s 10 runs
Benchmark 4: pypy script.py repodata.json.gz
Time (mean ± σ): 1.236 s ± 0.014 s [User: 0.974 s, System: 0.247 s]
Range (min … max): 1.220 s … 1.260 s 10 runs
Summary
'pypy script.py repodata.json.gz' ran
1.61 ± 0.02 times faster than './script.py repodata.json.gz'
6.56 ± 0.08 times faster than './target/release/json 1 repodata.json.gz'
6.58 ± 0.08 times faster than './target/release/json 0 repodata.json.gz'
Other features, zlib-ng-compat
, cloudfare_zlib
perform horrible time (more than 3 minutes for one run).
This is what I meant about buffering.
} else if buffer == "2" {
let value: serde_json::Value =
serde_json::from_reader(std::io::BufReader::new(niffler::from_path(json_path).unwrap().0)).unwrap();
count += value.as_object().unwrap().len();
}
On my slow laptop,
$ time target/release/json 2 ../repodata-000.json.gz
2 ../repodata-000.json.gz
5
real 0m4.732s
user 0m4.201s
sys 0m0.530s
$ time target/release/json 1 ../repodata-000.json.gz
1 ../repodata-000.json.gz
5
real 0m13.151s
user 0m12.368s
sys 0m0.782s
Thank you for spending so much time looking into this issue.
niffler::from_path
already use BufReader
. For me add a BufReader
on another BufReader
haven't any impact but your example show an impact, maybe because the first buffer contains compress data and second buffer uncompress data. So uncompress buffer is more consume than compress buffer.
Did you use default gzip compression level ?
Yes I used the default level
Ok add a buffer on uncompress reader seems to be a very good idea
Benchmark 1: ./target/release/json 2 repodata.json
Time (mean ± σ): 2.352 s ± 0.020 s [User: 2.135 s, System: 0.204 s]
Range (min … max): 2.321 s … 2.385 s 10 runs
Benchmark 2: ./target/release/json 1 repodata.json
Time (mean ± σ): 2.843 s ± 0.028 s [User: 2.607 s, System: 0.219 s]
Range (min … max): 2.805 s … 2.886 s 10 runs
Benchmark 3: ./target/release/json 0 repodata.json
Time (mean ± σ): 32.659 s ± 0.132 s [User: 6.204 s, System: 26.310 s]
Range (min … max): 32.418 s … 32.874 s 10 runs
Benchmark 4: ./target/release/json 2 repodata.json.bz2
Time (mean ± σ): 5.072 s ± 0.077 s [User: 4.813 s, System: 0.224 s]
Range (min … max): 4.973 s … 5.191 s 10 runs
Benchmark 5: ./target/release/json 1 repodata.json.bz2
Time (mean ± σ): 6.306 s ± 0.103 s [User: 6.038 s, System: 0.226 s]
Range (min … max): 6.150 s … 6.479 s 10 runs
Benchmark 6: ./target/release/json 0 repodata.json.bz2
Time (mean ± σ): 6.318 s ± 0.084 s [User: 6.041 s, System: 0.227 s]
Range (min … max): 6.207 s … 6.485 s 10 runs
Benchmark 7: ./target/release/json 2 repodata.json.gz
Time (mean ± σ): 2.596 s ± 0.029 s [User: 2.362 s, System: 0.214 s]
Range (min … max): 2.555 s … 2.666 s 10 runs
Benchmark 8: ./target/release/json 1 repodata.json.gz
Time (mean ± σ): 6.691 s ± 0.080 s [User: 6.414 s, System: 0.226 s]
Range (min … max): 6.598 s … 6.885 s 10 runs
Benchmark 9: ./target/release/json 0 repodata.json.gz
Time (mean ± σ): 6.663 s ± 0.053 s [User: 6.409 s, System: 0.216 s]
Range (min … max): 6.597 s … 6.742 s 10 runs
Benchmark 10: ./target/release/json 2 repodata.json.xz
Time (mean ± σ): 3.592 s ± 0.150 s [User: 3.352 s, System: 0.217 s]
Range (min … max): 3.466 s … 3.779 s 10 runs
Benchmark 11: ./target/release/json 1 repodata.json.xz
Time (mean ± σ): 11.597 s ± 0.055 s [User: 11.294 s, System: 0.230 s]
Range (min … max): 11.505 s … 11.652 s 10 runs
Benchmark 12: ./target/release/json 0 repodata.json.xz
Time (mean ± σ): 11.673 s ± 0.088 s [User: 11.364 s, System: 0.229 s]
Range (min … max): 11.544 s … 11.781 s 10 runs
pypy is still 2 times faster than rust:
Benchmark 1: ./target/release/json 2 repodata.json.gz
Time (mean ± σ): 2.600 s ± 0.021 s [User: 2.369 s, System: 0.208 s]
Range (min … max): 2.571 s … 2.640 s 10 runs
Benchmark 2: ./target/release/json 1 repodata.json.gz
Time (mean ± σ): 6.643 s ± 0.045 s [User: 6.378 s, System: 0.211 s]
Range (min … max): 6.567 s … 6.725 s 10 runs
Benchmark 3: ./target/release/json 0 repodata.json.gz
Time (mean ± σ): 6.652 s ± 0.047 s [User: 6.390 s, System: 0.223 s]
Range (min … max): 6.592 s … 6.750 s 10 runs
Benchmark 4: ./script.py repodata.json.gz
Time (mean ± σ): 2.188 s ± 0.032 s [User: 1.874 s, System: 0.298 s]
Range (min … max): 2.143 s … 2.244 s 10 runs
Benchmark 5: pypy script.py repodata.json.gz
Time (mean ± σ): 1.316 s ± 0.021 s [User: 1.003 s, System: 0.286 s]
Range (min … max): 1.298 s … 1.365 s 10 runs
Summary
'pypy script.py repodata.json.gz' ran
1.66 ± 0.04 times faster than './script.py repodata.json.gz'
1.98 ± 0.03 times faster than './target/release/json 2 repodata.json.gz'
5.05 ± 0.09 times faster than './target/release/json 1 repodata.json.gz'
5.05 ± 0.09 times faster than './target/release/json 0 repodata.json.gz'
zlib-ng and cloudfare_zlib improve a little previous result.
Thanks again. ./target/release/json 2
is pretty good. pypy's json parser is probably not comparable to serde (very different, and doing optimizations that Rust would not consider)
Describe the bug
I was surprised that
serde_json::from_reader(niffler from_path)
was slower than Python'sjson.load(gzip.open("repodata.json.gz"))
It helps if you wrap the niffler reader in a
let mut buffered = BufReader::new(reader);
, for a time ofreal 0m3.448s
On the other hand, decompressing to
std::io::sink
is very fast.nifftime.zip.zip