Closed nknize closed 8 months ago
@nknize we cannot add new features in patch release (i.e 2.9.1) rather only bug fix and security patches as per SemVar guidelines. We can definitely get the bug fix in but I am not sure about adding experimental flag since that might be considered a feature, so we can target those changes only for next minor release.
...we cannot add new features in patch release (i.e 2.9.1) ....per SemVar guidelines. ...I am not sure if adding experimental flag since that would be considered a feature,...
@bbarani Unfortunately we can't allow ZStd to remain available by default because rolling this back requires users to reindex if they chose these codecs. We have to revert this feature ASAP before users start indexing. This was mistakenly released GA, hence the patch.
Regarding the option to revert opensearch-project/OpenSearch#7908, I just want to note that will leave any users who indexed data with zstd using 2.9.0 in a bad spot. Their only option would be to reindex that data with 2.9.0 before upgrading to 2.9.1, since the 2.9.1 release will not contain the sandbox plugin functionality.
Great point @andrross! I'm adding context from follow on discussion from slack just so we don't lose it.
Barani Bikshandi (Barani): Yeah but can this be remediated with "revert" rather than adding experimental flag for 2.9.1? Basically remove support for zstd in 2.9.1? (edited)
I added this as an alternative to discuss.
One of those reasons is the hard dependency on native compiled code, which often leads to portability issues due to glibc differences
@nknize - What is the general guideline for any native dependency? Are you saying OpenSearch will never support native dependencies going forward?
Move the ZstdCodec and ZstdNoDictCodec out from being a default module into an optional location (e.g., either an optional plugin or library - note that the BlobStoreRepository compression is already in as an optional library, but its packaged and included by default so we still need to figure out how to make that optional).
What is the intent of this? By moving to optional are you suggesting users to use Zstd at their own risk? Zstd is providing 30% reduction in storage size and latencies as good as best_speed. What is the alternative to this?
Are you saying OpenSearch will never support native dependencies going forward?
As plugins, sure. But likely not as default modules. Plugins are just like modules except users have to opt in by installing them.
By moving to optional are you suggesting users to use Zstd at their own risk?
Depends on how you define "risk". ldd errors due to libc issues is certainly a "risk". An alternative to the jni implementation would be to use the pure java solution. Then we could include it as a module without portability risks.
Zstd is providing 30% reduction in storage size and latencies as good as best_speed.
That's only part of the picture. Retrieval time per 10k docs is nearly five times slower than BEST_SPEED. So it's not without its tradeoffs.
@nknize @backslasht sharing some search numbers from my runs with NYC Taxis
Cluster Configuration:
Server Setup a. One data node, r5.2xlarge backed by EBS GP3 volume of size 100gb. b. Three master nodes, r5.xlarge instance type
Benchmarking Setup Benchmark was run on a single node of type c5.4xlarge backed by EBS GP3 volume of size 500gb. a. Number of clients: 16 and bulk size: 1024. b. Workload: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis
Index Setup a. Number of shards: 1 b. Number of replicas: 0
Benchmarks
@sarthakaggarwal97 - Can you add details about the cluster configuration for the run?
Depends on how you define "risk". ldd errors due to libc issues is certainly a "risk". An alternative to the jni implementation would be to use the pure java solution. Then we could include it as a module without portability risks.
Fair point!
Can you add details about the cluster configuration for the run?
I've updated it in my previous comment alongside the numbers, thanks @backslasht
@sarthakaggarwal97 sadly the latency charts do not reflect memory leaks (especially the native one)
An alternative to the jni implementation would be to use the pure java solution.
@nknize afaik the pure Java implementation exists and was evaluated with terrible performance (comparing to ZSTD-jni)
...sharing some search numbers from my runs with NYC Taxis
@sarthakaggarwal97 can you add two things to your nyc_taxi benchmark run:
term
-> date_histogram
-> top_hits
agg in #1647? "store": true
to the vendor_name
fieldThe reasoning is: regressions have historically occurred when decompressing stored fields and I don't see any stored fields in the nyc_taxis workload. We should make this part of the regular benchmark to prevent us from missing another regression.
...afaik the pure Java implementation exists and was evaluated with terrible performance (comparing to ZSTD-jni)
@reta Do you have that benchmark handy? I haven't dug around for it and the AirCompressor README touts "...typically 10-40% faster than the JNI wrapper for the native libraries." The only thing I quickly dug up was the comment justifying the purpose of the library, which aligns with our purpose of avoiding strict native dependencies.
@sarthakaggarwal97 It's also worth running luceneutil's StoredFieldsBenchmark
with the ZStdCodec
. We could do this in a separate repo and even pull in the AirCompressor pure java implementation for comparison.
@reta Do you have that benchmark handy?
Hm ... I definitely seen somewhere, there are mentions here https://github.com/apache/lucene/issues/9784#issuecomment-1223937568, and here https://github.com/opensearch-project/OpenSearch/issues/3354 for LZ4 (not applicable to ZSTD).
There are benchmarks available in the repo, I will run them: https://github.com/airlift/aircompressor/blob/master/src/test/java/io/airlift/compress/benchmark/CompressionBenchmark.java
The reasoning is: regressions have https://github.com/opensearch-project/OpenSearch/issues/1647#issuecomment-986964918 when decompressing stored fields and I don't see any stored fields in the nyc_taxis workload. We should make this part of the regular benchmark to prevent us from missing another regression.
@nknize Shouldn't _source
cover the stored field as it is returned for the configured default/range queries in nyc taxis?
Sorry this issue was closed automatically after https://github.com/opensearch-project/OpenSearch/pull/9431
Shouldn't
_source
cover the stored field as it is returned for the configured default/range queries in nyc taxis?
@mgodwan For queries, _source
is only fetched for the first 10 results. TopHitsAggregator
is particularly of interest (and adversarial) because it executes the fetchPhase
for topDocs.size() when the aggregation is built. This is why that regression linked above wasn't caught before the release, the benchmark workloads didn't test for these potentially "evil" aggregations.
@nknize I've run aircompressor benchmarks locally on my machine ( i7-10750H × 12, 64Gb, Linux):
Here are the compression mem throughput:
compress airlift_zstd canterbury/alice29.txt 56,746 101.3MB/s ± 796.4kB/s ( 0.77%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/asyoulik.txt 50,753 76.0MB/s ± 9315.6kB/s (11.97%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/cp.html 8,566 93.3MB/s ± 15.5MB/s (16.64%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/fields.c 3,427 102.8MB/s ± 25.6MB/s (24.94%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/grammar.lsp 1,327 85.0MB/s ± 12.5MB/s (14.73%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/kennedy.xls 112,690 184.5MB/s ± 5800.4kB/s ( 3.07%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/lcet10.txt 140,687 104.2MB/s ± 3184.5kB/s ( 2.99%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/plrabn12.txt 191,230 79.8MB/s ± 2585.3kB/s ( 3.16%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/ptt5 54,581 262.0MB/s ± 20.9MB/s ( 7.97%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/sum 13,408 99.3MB/s ± 4520.0kB/s ( 4.45%) (N = 30, α = 99.9%)
compress airlift_zstd canterbury/xargs.1 1,838 92.5MB/s ± 6027.6kB/s ( 6.36%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/dickens 3,666,832 99.5MB/s ± 2951.9kB/s ( 2.90%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/mozilla 18,577,375 124.3MB/s ± 7819.5kB/s ( 6.14%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/mr 3,560,793 115.2MB/s ± 4011.2kB/s ( 3.40%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/nci 2,889,641 400.0MB/s ± 15.0MB/s ( 3.75%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/ooffice 3,147,867 92.3MB/s ± 8095.0kB/s ( 8.56%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/osdb 3,515,524 131.7MB/s ± 4193.8kB/s ( 3.11%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/reymont 1,958,308 139.9MB/s ± 1997.9kB/s ( 1.39%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/samba 5,097,907 187.4MB/s ± 7248.4kB/s ( 3.78%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/sao 5,591,044 62.2MB/s ± 10.1MB/s (16.25%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/webster 12,108,729 105.5MB/s ± 9211.8kB/s ( 8.53%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/x-ray 6,229,560 57.9MB/s ± 7637.8kB/s (12.87%) (N = 30, α = 99.9%)
compress airlift_zstd silesia/xml 641,486 240.9MB/s ± 21.1MB/s ( 8.74%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/bib 37,329 98.0MB/s ± 10154.2kB/s (10.12%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/book1 307,645 85.5MB/s ± 6029.9kB/s ( 6.89%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/book2 205,814 106.1MB/s ± 5918.6kB/s ( 5.45%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/geo 69,249 61.4MB/s ± 1123.1kB/s ( 1.79%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/news 139,728 89.0MB/s ± 8312.1kB/s ( 9.12%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/obj1 10,786 98.6MB/s ± 4930.9kB/s ( 4.88%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/obj2 83,275 116.6MB/s ± 3592.6kB/s ( 3.01%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/paper1 19,694 106.4MB/s ± 3150.5kB/s ( 2.89%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/paper2 30,900 88.2MB/s ± 8110.2kB/s ( 8.98%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/paper3 18,932 90.5MB/s ± 5466.5kB/s ( 5.90%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/paper4 5,760 96.8MB/s ± 1023.9kB/s ( 1.03%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/paper5 5,246 98.5MB/s ± 3130.5kB/s ( 3.10%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/paper6 14,268 105.9MB/s ± 2828.0kB/s ( 2.61%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/pic 54,581 315.1MB/s ± 3718.4kB/s ( 1.15%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/progc 14,421 94.3MB/s ± 8889.6kB/s ( 9.20%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/progl 17,877 147.8MB/s ± 10.5MB/s ( 7.12%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/progp 12,362 160.3MB/s ± 2177.4kB/s ( 1.33%) (N = 30, α = 99.9%)
compress airlift_zstd calgary/trans 20,790 186.2MB/s ± 3830.5kB/s ( 2.01%) (N = 30, α = 99.9%)
compress airlift_zstd artificial/a.txt 14 1050.6kB/s ± 38.1kB/s ( 3.63%) (N = 30, α = 99.9%)
compress airlift_zstd artificial/aaa.txt 26 1813.0MB/s ± 88.1MB/s ( 4.86%) (N = 30, α = 99.9%)
compress airlift_zstd artificial/alphabet.txt 50 1797.3MB/s ± 58.5MB/s ( 3.25%) (N = 30, α = 99.9%)
compress airlift_zstd artificial/random.txt 75,421 317.3MB/s ± 3277.7kB/s ( 1.01%) (N = 30, α = 99.9%)
compress airlift_zstd artificial/uniform_ascii.bin 8,842 226.9MB/s ± 20.2MB/s ( 8.91%) (N = 30, α = 99.9%)
compress airlift_zstd large/bible.txt 1,183,732 129.2MB/s ± 6002.6kB/s ( 4.54%) (N = 30, α = 99.9%)
compress airlift_zstd large/E.coli 1,413,593 142.7MB/s ± 2199.7kB/s ( 1.51%) (N = 30, α = 99.9%)
compress airlift_zstd large/world192.txt 659,802 142.2MB/s ± 3032.7kB/s ( 2.08%) (N = 30, α = 99.9%)
compress airlift_zstd geo.protodata 14,096 344.6MB/s ± 14.6MB/s ( 4.24%) (N = 30, α = 99.9%)
compress airlift_zstd house.jpg 126,974 259.0MB/s ± 9267.6kB/s ( 3.49%) (N = 30, α = 99.9%)
compress airlift_zstd html 14,928 276.9MB/s ± 8902.3kB/s ( 3.14%) (N = 30, α = 99.9%)
compress airlift_zstd kppkn.gtb 41,647 152.9MB/s ± 8482.8kB/s ( 5.42%) (N = 30, α = 99.9%)
compress airlift_zstd mapreduce-osdi-1.pdf 75,523 351.6MB/s ± 5871.6kB/s ( 1.63%) (N = 30, α = 99.9%)
compress airlift_zstd urls.10K 185,565 146.7MB/s ± 1657.5kB/s ( 1.10%) (N = 30, α = 99.9%)
vs
compress zstd_jni canterbury/alice29.txt 56,271 160.6MB/s ± 3078.0kB/s ( 1.87%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/asyoulik.txt 50,363 154.6MB/s ± 1365.1kB/s ( 0.86%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/cp.html 8,465 232.3MB/s ± 8557.1kB/s ( 3.60%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/fields.c 3,379 280.2MB/s ± 10.6MB/s ( 3.79%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/grammar.lsp 1,290 264.9MB/s ± 4227.8kB/s ( 1.56%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/kennedy.xls 111,742 415.8MB/s ± 6385.0kB/s ( 1.50%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/lcet10.txt 139,324 166.4MB/s ± 6885.7kB/s ( 4.04%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/plrabn12.txt 190,276 115.1MB/s ± 15.7MB/s (13.62%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/ptt5 54,446 521.6MB/s ± 44.8MB/s ( 8.59%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/sum 13,375 228.5MB/s ± 6798.6kB/s ( 2.91%) (N = 30, α = 99.9%)
compress zstd_jni canterbury/xargs.1 1,800 231.7MB/s ± 4311.6kB/s ( 1.82%) (N = 30, α = 99.9%)
compress zstd_jni silesia/dickens 3,631,607 164.6MB/s ± 1239.1kB/s ( 0.74%) (N = 30, α = 99.9%)
compress zstd_jni silesia/mozilla 18,487,478 254.5MB/s ± 8180.9kB/s ( 3.14%) (N = 30, α = 99.9%)
compress zstd_jni silesia/mr 3,547,800 194.6MB/s ± 2049.6kB/s ( 1.03%) (N = 30, α = 99.9%)
compress zstd_jni silesia/nci 2,843,995 690.5MB/s ± 7290.3kB/s ( 1.03%) (N = 30, α = 99.9%)
compress zstd_jni silesia/ooffice 3,143,604 153.3MB/s ± 8214.5kB/s ( 5.23%) (N = 30, α = 99.9%)
compress zstd_jni silesia/osdb 3,506,444 221.6MB/s ± 7585.6kB/s ( 3.34%) (N = 30, α = 99.9%)
compress zstd_jni silesia/reymont 1,942,004 200.3MB/s ± 9344.7kB/s ( 4.56%) (N = 30, α = 99.9%)
compress zstd_jni silesia/samba 4,976,446 309.1MB/s ± 30.8MB/s ( 9.97%) (N = 30, α = 99.9%)
compress zstd_jni silesia/sao 5,551,154 123.5MB/s ± 14.9MB/s (12.10%) (N = 30, α = 99.9%)
compress zstd_jni silesia/webster 11,981,183 189.5MB/s ± 7174.5kB/s ( 3.70%) (N = 30, α = 99.9%)
compress zstd_jni silesia/x-ray 6,085,291 124.5MB/s ± 2117.3kB/s ( 1.66%) (N = 30, α = 99.9%)
compress zstd_jni silesia/xml 639,134 502.7MB/s ± 9095.5kB/s ( 1.77%) (N = 30, α = 99.9%)
compress zstd_jni calgary/bib 37,046 192.5MB/s ± 4947.9kB/s ( 2.51%) (N = 30, α = 99.9%)
compress zstd_jni calgary/book1 305,543 144.0MB/s ± 3083.5kB/s ( 2.09%) (N = 30, α = 99.9%)
compress zstd_jni calgary/book2 203,941 178.5MB/s ± 4210.9kB/s ( 2.30%) (N = 30, α = 99.9%)
compress zstd_jni calgary/geo 69,219 114.7MB/s ± 4773.0kB/s ( 4.06%) (N = 30, α = 99.9%)
compress zstd_jni calgary/news 138,021 166.0MB/s ± 5224.0kB/s ( 3.07%) (N = 30, α = 99.9%)
compress zstd_jni calgary/obj1 10,772 198.7MB/s ± 7577.9kB/s ( 3.72%) (N = 30, α = 99.9%)
compress zstd_jni calgary/obj2 83,359 182.4MB/s ± 5750.3kB/s ( 3.08%) (N = 30, α = 99.9%)
compress zstd_jni calgary/paper1 19,511 176.0MB/s ± 5514.1kB/s ( 3.06%) (N = 30, α = 99.9%)
compress zstd_jni calgary/paper2 30,618 168.0MB/s ± 3833.8kB/s ( 2.23%) (N = 30, α = 99.9%)
compress zstd_jni calgary/paper3 18,736 151.9MB/s ± 6142.8kB/s ( 3.95%) (N = 30, α = 99.9%)
compress zstd_jni calgary/paper4 5,679 184.6MB/s ± 4219.1kB/s ( 2.23%) (N = 30, α = 99.9%)
compress zstd_jni calgary/paper5 5,152 195.2MB/s ± 2426.9kB/s ( 1.21%) (N = 30, α = 99.9%)
compress zstd_jni calgary/paper6 14,029 185.4MB/s ± 5817.8kB/s ( 3.07%) (N = 30, α = 99.9%)
compress zstd_jni calgary/pic 54,446 576.1MB/s ± 7905.9kB/s ( 1.34%) (N = 30, α = 99.9%)
compress zstd_jni calgary/progc 14,167 194.4MB/s ± 11.7MB/s ( 6.02%) (N = 30, α = 99.9%)
compress zstd_jni calgary/progl 17,581 271.3MB/s ± 8018.9kB/s ( 2.89%) (N = 30, α = 99.9%)
compress zstd_jni calgary/progp 12,149 289.7MB/s ± 3417.5kB/s ( 1.15%) (N = 30, α = 99.9%)
compress zstd_jni calgary/trans 20,595 331.0MB/s ± 3422.0kB/s ( 1.01%) (N = 30, α = 99.9%)
compress zstd_jni artificial/a.txt 10 1307.6kB/s ± 38.1kB/s ( 2.91%) (N = 30, α = 99.9%)
compress zstd_jni artificial/aaa.txt 22 4320.9MB/s ± 389.5MB/s ( 9.01%) (N = 30, α = 99.9%)
compress zstd_jni artificial/alphabet.txt 46 5031.7MB/s ± 358.4MB/s ( 7.12%) (N = 30, α = 99.9%)
compress zstd_jni artificial/random.txt 75,048 979.2MB/s ± 40.3MB/s ( 4.11%) (N = 30, α = 99.9%)
compress zstd_jni artificial/uniform_ascii.bin 8,838 614.6MB/s ± 4906.5kB/s ( 0.78%) (N = 30, α = 99.9%)
compress zstd_jni large/bible.txt 1,173,315 214.6MB/s ± 2139.8kB/s ( 0.97%) (N = 30, α = 99.9%)
compress zstd_jni large/E.coli 1,392,186 208.4MB/s ± 4167.2kB/s ( 1.95%) (N = 30, α = 99.9%)
compress zstd_jni large/world192.txt 650,815 217.8MB/s ± 6117.0kB/s ( 2.74%) (N = 30, α = 99.9%)
compress zstd_jni geo.protodata 14,079 682.8MB/s ± 29.8MB/s ( 4.37%) (N = 30, α = 99.9%)
compress zstd_jni house.jpg 126,970 881.7MB/s ± 28.2MB/s ( 3.20%) (N = 30, α = 99.9%)
compress zstd_jni html 14,798 512.1MB/s ± 16.3MB/s ( 3.18%) (N = 30, α = 99.9%)
compress zstd_jni kppkn.gtb 40,850 258.4MB/s ± 15.3MB/s ( 5.94%) (N = 30, α = 99.9%)
compress zstd_jni mapreduce-osdi-1.pdf 75,594 515.8MB/s ± 35.0MB/s ( 6.78%) (N = 30, α = 99.9%)
compress zstd_jni urls.10K 183,492 235.6MB/s ± 5904.0kB/s ( 2.45%) (N = 30, α = 99.9%)
Same for decompression:
decompress airlift_zstd canterbury/alice29.txt 56,746 456.4MB/s ± 21.9MB/s ( 4.80%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/asyoulik.txt 50,753 490.3MB/s ± 21.7MB/s ( 4.43%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/cp.html 8,566 634.3MB/s ± 4931.3kB/s ( 0.76%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/fields.c 3,427 551.0MB/s ± 13.6MB/s ( 2.46%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/grammar.lsp 1,327 405.1MB/s ± 11.3MB/s ( 2.79%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/kennedy.xls 112,690 663.5MB/s ± 4931.5kB/s ( 0.73%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/lcet10.txt 140,687 559.4MB/s ± 24.5MB/s ( 4.38%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/plrabn12.txt 191,230 481.3MB/s ± 15.7MB/s ( 3.27%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/ptt5 54,581 1135.9MB/s ± 16.2MB/s ( 1.43%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/sum 13,408 590.6MB/s ± 21.0MB/s ( 3.56%) (N = 30, α = 99.9%)
decompress airlift_zstd canterbury/xargs.1 1,838 372.8MB/s ± 13.3MB/s ( 3.56%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/dickens 3,666,832 503.4MB/s ± 12.1MB/s ( 2.40%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/mozilla 18,577,375 562.5MB/s ± 18.5MB/s ( 3.29%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/mr 3,560,793 525.6MB/s ± 29.3MB/s ( 5.58%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/nci 2,889,641 1141.0MB/s ± 42.9MB/s ( 3.76%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/ooffice 3,147,867 404.2MB/s ± 22.0MB/s ( 5.45%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/osdb 3,515,524 659.0MB/s ± 19.0MB/s ( 2.88%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/reymont 1,958,308 593.1MB/s ± 26.1MB/s ( 4.40%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/samba 5,097,907 755.9MB/s ± 25.1MB/s ( 3.32%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/sao 5,591,044 430.6MB/s ± 17.3MB/s ( 4.02%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/webster 12,108,729 564.4MB/s ± 30.1MB/s ( 5.33%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/x-ray 6,229,560 374.6MB/s ± 15.2MB/s ( 4.06%) (N = 30, α = 99.9%)
decompress airlift_zstd silesia/xml 641,486 1064.9MB/s ± 27.1MB/s ( 2.55%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/bib 37,329 606.5MB/s ± 12.7MB/s ( 2.09%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/book1 307,645 468.3MB/s ± 15.2MB/s ( 3.25%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/book2 205,814 554.2MB/s ± 20.1MB/s ( 3.62%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/geo 69,249 420.8MB/s ± 24.7MB/s ( 5.88%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/news 139,728 580.7MB/s ± 4045.2kB/s ( 0.68%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/obj1 10,786 512.7MB/s ± 15.8MB/s ( 3.08%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/obj2 83,275 563.4MB/s ± 1953.8kB/s ( 0.34%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/paper1 19,694 573.3MB/s ± 3471.3kB/s ( 0.59%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/paper2 30,900 534.0MB/s ± 21.7MB/s ( 4.07%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/paper3 18,932 515.5MB/s ± 11.8MB/s ( 2.30%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/paper4 5,760 437.3MB/s ± 3951.0kB/s ( 0.88%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/paper5 5,246 433.8MB/s ± 7494.2kB/s ( 1.69%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/paper6 14,268 534.5MB/s ± 19.4MB/s ( 3.63%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/pic 54,581 1140.7MB/s ± 14.2MB/s ( 1.24%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/progc 14,421 584.6MB/s ± 5636.1kB/s ( 0.94%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/progl 17,877 693.9MB/s ± 26.3MB/s ( 3.79%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/progp 12,362 762.3MB/s ± 13.3MB/s ( 1.74%) (N = 30, α = 99.9%)
decompress airlift_zstd calgary/trans 20,790 824.6MB/s ± 9032.4kB/s ( 1.07%) (N = 30, α = 99.9%)
decompress airlift_zstd artificial/a.txt 14 42.3MB/s ± 1725.0kB/s ( 3.99%) (N = 30, α = 99.9%)
decompress airlift_zstd artificial/aaa.txt 26 4627.3MB/s ± 69.2MB/s ( 1.50%) (N = 30, α = 99.9%)
decompress airlift_zstd artificial/alphabet.txt 50 4338.0MB/s ± 15.6MB/s ( 0.36%) (N = 30, α = 99.9%)
decompress airlift_zstd artificial/random.txt 75,421 494.6MB/s ± 21.7MB/s ( 4.38%) (N = 30, α = 99.9%)
decompress airlift_zstd artificial/uniform_ascii.bin 8,842 504.1MB/s ± 5961.5kB/s ( 1.15%) (N = 30, α = 99.9%)
decompress airlift_zstd large/bible.txt 1,183,732 620.7MB/s ± 6002.5kB/s ( 0.94%) (N = 30, α = 99.9%)
decompress airlift_zstd large/E.coli 1,413,593 525.6MB/s ± 15.6MB/s ( 2.96%) (N = 30, α = 99.9%)
decompress airlift_zstd large/world192.txt 659,802 677.7MB/s ± 12.6MB/s ( 1.86%) (N = 30, α = 99.9%)
decompress airlift_zstd large/bible.txt 1,183,732 620.7MB/s ± 6002.5kB/s ( 0.94%) (N = 30, α = 99.9%)
decompress airlift_zstd large/E.coli 1,413,593 525.6MB/s ± 15.6MB/s ( 2.96%) (N = 30, α = 99.9%)
decompress airlift_zstd large/world192.txt 659,802 677.7MB/s ± 12.6MB/s ( 1.86%) (N = 30, α = 99.9%)
decompress airlift_zstd geo.protodata 14,096 1342.8MB/s ± 8880.6kB/s ( 0.65%) (N = 30, α = 99.9%)
decompress airlift_zstd house.jpg 126,974 9324.2MB/s ± 482.0MB/s ( 5.17%) (N = 30, α = 99.9%)
decompress airlift_zstd html 14,928 1163.7MB/s ± 90.9MB/s ( 7.81%) (N = 30, α = 99.9%)
decompress airlift_zstd kppkn.gtb 41,647 536.1MB/s ± 45.3MB/s ( 8.45%) (N = 30, α = 99.9%)
decompress airlift_zstd mapreduce-osdi-1.pdf 75,523 988.7MB/s ± 164.2MB/s (16.61%) (N = 30, α = 99.9%)
decompress airlift_zstd urls.10K 185,565 538.9MB/s ± 115.6MB/s (21.45%) (N = 30, α = 99.9%)
vs
decompress zstd_jni canterbury/alice29.txt 56,271 855.6MB/s ± 48.4MB/s ( 5.66%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/asyoulik.txt 50,363 1019.2MB/s ± 32.6MB/s ( 3.20%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/cp.html 8,465 1122.5MB/s ± 46.1MB/s ( 4.11%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/fields.c 3,379 958.6MB/s ± 23.5MB/s ( 2.45%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/grammar.lsp 1,290 671.6MB/s ± 14.1MB/s ( 2.10%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/kennedy.xls 111,742 1150.7MB/s ± 46.3MB/s ( 4.02%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/lcet10.txt 139,324 1120.4MB/s ± 32.3MB/s ( 2.88%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/plrabn12.txt 190,276 930.2MB/s ± 31.1MB/s ( 3.34%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/ptt5 54,446 2138.5MB/s ± 52.5MB/s ( 2.45%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/sum 13,375 1273.0MB/s ± 9259.5kB/s ( 0.71%) (N = 30, α = 99.9%)
decompress zstd_jni canterbury/xargs.1 1,800 682.0MB/s ± 10101.9kB/s ( 1.45%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/dickens 3,631,607 882.2MB/s ± 41.8MB/s ( 4.74%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/mozilla 18,487,478 1220.1MB/s ± 46.6MB/s ( 3.82%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/mr 3,547,800 1101.9MB/s ± 26.3MB/s ( 2.38%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/nci 2,843,995 1832.1MB/s ± 98.4MB/s ( 5.37%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/ooffice 3,143,604 905.3MB/s ± 44.9MB/s ( 4.96%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/osdb 3,506,444 1351.4MB/s ± 93.7MB/s ( 6.93%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/reymont 1,942,004 1040.3MB/s ± 49.9MB/s ( 4.80%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/samba 4,976,446 1717.6MB/s ± 15.4MB/s ( 0.90%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/sao 5,551,154 1019.3MB/s ± 3488.9kB/s ( 0.33%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/webster 11,981,183 1034.2MB/s ± 39.7MB/s ( 3.83%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/x-ray 6,085,291 904.8MB/s ± 8415.4kB/s ( 0.91%) (N = 30, α = 99.9%)
decompress zstd_jni silesia/xml 639,134 2177.3MB/s ± 11.8MB/s ( 0.54%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/bib 37,046 1164.0MB/s ± 41.6MB/s ( 3.57%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/book1 305,543 990.4MB/s ± 9123.6kB/s ( 0.90%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/book2 203,941 1133.0MB/s ± 5780.3kB/s ( 0.50%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/geo 69,219 1149.4MB/s ± 32.7MB/s ( 2.85%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/news 138,021 1220.7MB/s ± 7415.9kB/s ( 0.59%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/obj1 10,772 1274.9MB/s ± 4696.8kB/s ( 0.36%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/obj2 83,359 1008.0MB/s ± 39.7MB/s ( 3.94%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/paper1 19,511 1042.7MB/s ± 66.6MB/s ( 6.39%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/paper2 30,618 910.4MB/s ± 124.5MB/s (13.67%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/paper3 18,736 912.9MB/s ± 108.4MB/s (11.87%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/paper4 5,679 728.4MB/s ± 70.5MB/s ( 9.67%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/paper5 5,152 758.3MB/s ± 65.2MB/s ( 8.60%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/paper6 14,029 1112.4MB/s ± 7842.4kB/s ( 0.69%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/pic 54,446 1922.9MB/s ± 206.2MB/s (10.72%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/progc 14,167 664.5MB/s ± 71.1MB/s (10.70%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/progl 17,581 987.9MB/s ± 124.1MB/s (12.56%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/progp 12,149 1029.5MB/s ± 129.4MB/s (12.57%) (N = 30, α = 99.9%)
decompress zstd_jni calgary/trans 20,595 1297.8MB/s ± 145.4MB/s (11.20%) (N = 30, α = 99.9%)
decompress zstd_jni artificial/a.txt 10 1952.4kB/s ± 228.0kB/s (11.68%) (N = 30, α = 99.9%)
decompress zstd_jni artificial/aaa.txt 22 6132.1MB/s ± 174.4MB/s ( 2.84%) (N = 30, α = 99.9%)
decompress zstd_jni artificial/alphabet.txt 46 3642.8MB/s ± 113.3MB/s ( 3.11%) (N = 30, α = 99.9%)
decompress zstd_jni artificial/random.txt 75,048 1281.0MB/s ± 86.5MB/s ( 6.76%) (N = 30, α = 99.9%)
decompress zstd_jni artificial/uniform_ascii.bin 8,838 1206.7MB/s ± 45.6MB/s ( 3.78%) (N = 30, α = 99.9%)
decompress zstd_jni large/bible.txt 1,173,315 1041.1MB/s ± 49.5MB/s ( 4.75%) (N = 30, α = 99.9%)
decompress zstd_jni large/E.coli 1,392,186 995.2MB/s ± 37.7MB/s ( 3.79%) (N = 30, α = 99.9%)
decompress zstd_jni large/world192.txt 650,815 1241.6MB/s ± 56.8MB/s ( 4.57%) (N = 30, α = 99.9%)
decompress zstd_jni geo.protodata 14,079 2804.8MB/s ± 107.9MB/s ( 3.85%) (N = 30, α = 99.9%)
decompress zstd_jni house.jpg 126,970 30.9GB/s ± 1466.7MB/s ( 4.64%) (N = 30, α = 99.9%)
decompress zstd_jni html 14,798 2444.8MB/s ± 78.2MB/s ( 3.20%) (N = 30, α = 99.9%)
decompress zstd_jni kppkn.gtb 40,850 948.1MB/s ± 52.7MB/s ( 5.56%) (N = 30, α = 99.9%)
decompress zstd_jni mapreduce-osdi-1.pdf 75,594 6662.2MB/s ± 108.4MB/s ( 1.63%) (N = 30, α = 99.9%)
decompress zstd_jni urls.10K 183,492 1503.3MB/s ± 53.2MB/s ( 3.54%) (N = 30, α = 99.9%)
As per this numbers, aircompressor
looks significantly slower.
@reta what is the fourth column?
@reta what is the fourth column?
You mean the percentage (like ( 3.54%)
): stats.getMeanErrorAt(0.999) * 100 / stats.getMean(),
, the whole output is that:
int compressSize = compressSize(algorithm, name);
System.out.printf(" %-10s %-22s %-25s %,11d %10s ± %11s (%5.2f%%) (N = %d, \u03B1 = 99.9%%)\n",
result.getPrimaryResult().getLabel(),
algorithm,
name,
compressSize,
Util.toHumanReadableSpeed((long) stats.getMean()),
Util.toHumanReadableSpeed((long) stats.getMeanErrorAt(0.999)),
stats.getMeanErrorAt(0.999) * 100 / stats.getMean(),
stats.getN());
Ah.. it's compressSize
. I dropped an issue on the Aircompressor repo about these numbers. Hopefully we can reconcile this and see if there are any corners missed that might squeeze out better speed. I have to imagine SIMD could help? Maybe better parallelize the compression through ByteVector usage? I'm naively optimistic about that.
@nknize @reta @backslasht @shwetathareja
With the similar cluster configuration mentioned over here, alongside Nick's recommendations
- this term -> date_histogram -> top_hits agg in https://github.com/opensearch-project/OpenSearch/issues/1647#issue-1069897761?
- "store": true to the vendor_name field
These were the search numbers I got from the benchmarks with NYC Taxi Dataset
LZ4
50th percentile service time,top_agg_hits,1256.558088120073,ms
90th percentile service time,top_agg_hits,1278.5834005102515,ms
100th percentile service time,top_agg_hits,1801.7356134951115,ms
ZSTD NO DICT
50th percentile service time,top_agg_hits,2341.6134444996715,ms
90th percentile service time,top_agg_hits,2377.105840295553,ms
100th percentile service time,top_agg_hits,2634.960106573999,ms
ZSTD
50th percentile service time,top_agg_hits,3103.5107178613544,ms
90th percentile service time,top_agg_hits,3153.532434348017,ms
100th percentile service time,top_agg_hits,3601.2025149539113,ms
ZLIB
50th percentile service time,top_agg_hits,3411.0017553903162,ms
90th percentile service time,top_agg_hits,3443.0927176959813,ms
100th percentile service time,top_agg_hits,3785.180849954486,ms
In the profiles, around 4% of the CPU was consumed for decompression with LZ4 (Best Speed), 11% for ZSTD_NO_DICT and 16% for ZSTD and 29% for ZLIB (Best Compression)
Thanks @sarthakaggarwal97. Can you please create a separate issue to track the performance of Zstd decompress?
@nknize - Given that we have decided to move zstd codec as a plugin in next version (2.10) and we already have fixes for the memory leak opensearch-project/OpenSearch#9403 and upgraded zstd to 1.5.5-5 opensearch-project/OpenSearch#9431, do you still think that zstd needs to be marked as experimental?
@backslasht added opensearch-project/OpenSearch#9456 to track performance of Zstd decompress
...do you still think that zstd needs to be marked as experimental?
Yes, even moving to a plugin I think we should start with an experimental flag. It's easy to remove experimental, it's hard to revert (as this situation has shown). I think we should keep it experimental until either 1. there's more evidence the JNI is reliable and won't melt a node (this includes determining whether we should add an expert setting to select direct/indirect ctor) , or 2. the AirCompressor pure java implementation proves to be reasonably competitive.
Re 2: I think we could potentially contribute to that project and explore SIMD optimizations. This would have two major benefits, 1: we get a pure java zstd for OpenSearch that would move back to a module (since performance looks promising), 2. we could contribute this upstream to Lucene and not have to carry a custom ZStdCodec implementation in OpenSearch.
With all of this uncertainty, we absolutely should keep experimental for a while.
Updating this issue from some recent slack discussion. Per the issue description, we do not plan to demote this to a plugin in a 2.9.1 patch release. We need to feature flag the zstd
and zstd_no_dict
codec options (default disabled) in a 2.9.1 patch, annotate the options with @ExperimentalApi
, and continue the "demotion to plugin" discussion for 2.10.0 minor release. This keeps the 2.9.1 change minimal while protecting users from inadvertently using zstd
(explicit opt in required using the feature flag) and buys us runway to decide whether this remains a module or moved to a plugin in a followup minor release. I'm happy to open a PR to feature flag it (per our typical RTC PR approval process)... but I don't want to take that contribution opportunity from anyone else, so here's a call to the community to pick up that PR and earn some merit!
Release issue for 2.9.1 version - https://github.com/opensearch-project/opensearch-build/issues/3820 CC: @peterzhuamazon
Some thoughts on native dependencies.
First, AFAIK the rules discussed above are not spelled out anywhere. When zstd came along we said yes to the native dependencies in those PRs. @nknize, since you have strong opinions about the topic, maybe add them to the developer guide so we can all have some kind of agreement on what is ok and what is not ok? I don't see why native plugin would be ok, but native module would not, for example. (I agree that there's portability risk, however I don't know any specific examples where it's an actual problem.)
I'd add that since Lucene is a hard "no" on native components, in general accepting native code in OpenSearch creates a lot of opportunities for including code such as this one that has no other place to go in the ecosystem, but can improve performance, which means more happy users.
Then, we've had native libraries in plugins that are included by default in the OpenSearch distribution since the beginning (k-nn is the obvious example). And while we distribute the -min version, the product that most users download is the one with those plugins. It's also the product being run through batteries of tests. Finally, there's plenty of discussions about merging some of those plugins, including k-nn, into core, so we should have guidelines on native libraries before we have those discussions.
...since you have strong opinions about the topic, maybe add them to the developer guide...
+1. I agree we should probably codify this to prevent future traps.
...I agree that there's portability risk, however I don't know any specific examples where it's an actual problem
Two examples are provided in the description where linking native code as an OpenSearch module (which are installed by default) would've broken the min distribution. This may not always be the case though, Project Panama's improvements on hotspot interaction with native libraries is coming along and may provide enough improvements in JDK 22+ to reconsider this stance. In the meantime we should bias toward risk aversion by using plugins.
...in general accepting native code in OpenSearch creates a lot of opportunities
Absolutely! Having them as optional plugins prevents unexpected LDD or runtime behaviors beyond just linking failures (e.g., performance regressions due to native .so differences across platforms , same portability reasons Elastic still uses their own JNA).
...we've had native libraries in plugins that are included by default in the OpenSearch distribution since the beginning (k-nn is the obvious example).
These opensearch plugins are also still optional plugins (not installed by default) not modules.
...It's also the product being run through batteries of tests.
Progress is made but there are still many gaps, e.g., we were unaware of incompatibilities installing KNN Engine w/ CCR Engine until I saw it during an unrelated issue and even still we only hacked a patch by creating a CodecService extension point. You still can't create more than one engine and this doesn't seem to be "battle tested" anywhere. There's also a large handful of plugins still not using qa testing for mixed cluster or comprehensive bwc testing. There are a lot of opportunities to improve here which is why we need to bias toward plugins and not modules until we can get broader integration and cluster test coverage.
...discussions about merging some of those plugins, including k-nn, into core,
Those discussions consist of leveraging Lucene pure java HNSW implementation as the default in core and keeping the FAISS and nmslib native builds in the bundle plugin as an optional "performance boost". Tradeoffs of using native code should not be taken lightly
Maybe we can consider a roll-forward option that is lower risk in some ways and contains fewer changes in the patch:
The effect of this approach is that a user who is benefitting from zstd will not be ambushed by a disable-via-feature-flag in a patch.
It is also a tie-ourselves-to-the-mast approach to introducing a superior compression algorithm. We have testing in place to ensure that it works, and while there were holes this time around, that doesn't mean we are incapable of introducing this feature, which is around a 15% throughput improvement for some workloads. It may be work to maintain native code across platforms, but it's work worth doing for anyone who benefits from that kind of speedup.
Thanks @wbeckler for bringing up very valid points. I like the idea of not moving ZSTD behind the experimental flag but release a patch version with the fixes in.
ZSTD
as a codec, exhaustive testing was already done as part of opensearch-project/OpenSearch#7908. Considering the above points, I don't see a strong reason to move it to experimental as part of the patch release.
Per semver AFAIK we cannot be adding/removing a feature in the patch release as @bbarani points out, we can only make security fixes. It's important we earn users trust with semver consistently. Occasionally we do ship critical bug fixes in a patch release, so the memory leak fix may qualify. I agree with @backslasht and @wbeckler and propose we "do nothing else" and close this issue.
If we strongly believe that we should have never shipped a feature that relies on a native library (I don't agree, but open to discussing and changing my mind in a separate PR that spells it out), we can make any breaking change we want in 3.0.
I agree that the reliance on a native lib is not a good reason to revert the GA status of the feature. If we get consensus that having a native lib in a module is bad, then I think we should live with it until the 3.0 release for the reasons cited above.
To revert the GA status of the feature, I think the criteria needs to be that we think the current zstd feature is unsuitable for production use even after the two fixes (#9403 and opensearch-project/OpenSearch#9431) are released in a patch. I'm not convinced that is the case so I'm currently in favor of not moving ZSTD behind an experimental flag in the patch.
+1 We shouldn't move to experimental unless we are aware of any known issues. It should be good for production usage with zstd version upgrade and memory leak issue fix. Moving it to plugin in 2.x would break existing users. It should be done in 3.0.
From the conversations above, I see there is consensus to keep the zstd codec in the core for 2.x versions and revisit via a separate issue for 3.0. Please call out here with valid justification if you think otherwise.
...here is consensus to keep the zstd codec in the core for 2.x versions and revisit via a separate issue for 3.0.
Strong -1 (veto).
This change is codifying a Lucene codec in a minor release that Lucene themselves have rejected upstream for valid technical reasons I won't copy/paste here. OpenSearch should similarly hold a higher quality bar here and do the opposite as to what is being suggested: discuss codifying this native codec in 3.0 and make it optional in 2.x From a pure technical perspective, we have not quantified the native memory impact this codec carries nor do we have adequate circuit breakers to trip and prevent the OS from killing the node if/when the machine's direct memory is exhausted due to off heap buffer allocation. Additionally this was an accidental premature promotion of a feature to GA in a minor. It does not fall in the definition of Semver, in fact the opposite: it introduces an untested feature that affects on-disk representation that cannot be backed out in a rolling upgrade. It should be treated similar to a CVE and we should hot patch in 2.9.1 and reassess for 3.0 Choosing to keep this as a top level feature is not only reckless, it sets a precedent that top level mandatory core features can introduce native library dependencies with little to no performance analysis or bar raising. We owe our OpenSearch users better than that.
I'm not sure many, if any, non-AWSers are tracking here (likely because that top of funnel is still not there for reasons like this) but I'd like to hear from them on this issue. Seems I'm in the minority of the Amazonians when it comes to the problems introduced when shipping a core w/ a mandatory native library integration. Interestingly enough, I'm also one of the only ones that personally went through this on Elasticsearch and it's not something I desire to have happen to OpenSearch. Lucene doesn't prohibit native library linking because they feel like it and want to make things more difficult. There is a long tail technical history of direct memory allocation killing nodes as I've described here.
Update:
...that doesn't mean we are incapable of introducing this feature, which is around a 15% throughput improvement for some workloads. It may be work to maintain native code across platforms, but it's work worth doing for anyone who benefits from that kind of speedup.
The intentions here are good, however I'll point out that moving to an optional plugin is still releasing the feature... but with the added benefit that users HAVE to opt-in. What happens if a downstream takes a hard dependency on ZStd because they see it's a module installed by default? Then there's a libc that breaks their plugin and they don't know why and can't roll back? Moving as an optional plugin doesn't stop them from taking a dependency, but it will fail if they don't explicitly install the plugin. That's a good guard rail to have for something as trappy as this.
Additionally this was an accidental premature promotion of a feature to GA in a minor. It does not fall in the definition of Semver, in fact the opposite: it introduces an untested feature that affects on-disk representation that cannot be backed out in a rolling upgrade.
As non-AWS guy, that's to me is the core of the the problem. We should not have been promoted this feature from sandbox to GA (with zero community feedback). I do share the guilt and feel not good that we have to take a few steps back but we should fix the hiccup while it is not too late.
Strong -1 (veto).
@nknize We've never had a veto-type discussion in this project since the fork. So I am not sure how we move forward.
The rollback problem is a real one, open an issue (also how does it work for other new codecs?). On breaking semver, two wrongs don't make a right, so I don't agree it's worth doing another wrong for an issue that is not a security problem (the comparison to a CVE is baseless).
We should not have been promoted this feature from sandbox to GA (with zero community feedback).
@reta Maybe, but we did. We can call it a mistake, but it's too late, because now we would have to break semver. I don't understand how we can claim it doesn't when a user that enabled this feature cannot upgrade without breaking their system.
Finally, I don't agree that arguments in https://issues.apache.org/jira/browse/LUCENE-8739 apply here at all, because the majority of users download the complete distribution of OpenSearch that includes plugins, and many of those plugins include native code (e.g. k-nn). AFAIK that has had no serious consequences. I think allowing native code in core is an advantage, and the core engine is nothing special vs. these "default" plugins.
We've never had a veto-type discussion in this project since the fork. So I am not sure how we move forward.
We use them on nominations. We also use them in our Review then Commit policy on PRs where we require at least one approval and no changes / unresolved conversations (blocking vetos). The ASF has a good definition of Review then Commit I think we should lean into.
The rollback problem is a real one, open an issue
This is the rollback issue. Do you mean open a PR?
..also how does it work for other new codecs?
Depends on if there is a native dependency.
On breaking semver, two wrongs don't make a right, so I don't agree it's worth doing another wrong...
Guard rails to protect a users node from dying doesn't seem like a wrong? Maybe that's just me. I prefer to bias toward opt in if there's a chance a node could be killed due to unbounded native mallocs..
...it's too late, because now we would have to break semver.
Semver (which, as discussed many times before, we don't even enforce) isn't a hill worth dying on when it comes to the probability of real cluster crashes.. we need to consider what our priorities are.
...that includes plugins, and many of those plugins include native code
Again.. even those plugin.. are not installed by default so the guard rails are up and OpenSearch users have to explicitly opt in.
Again.. even those plugin.. are not installed by default so the guard rails are up and OpenSearch users have to explicitly opt in.
The k-nn plugin is installed by default when you download it from either the distribution on opensearch.org or docker, try this:
docker pull opensearchproject/opensearch:latest
docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:latest
[2023-08-28T22:03:01,705][INFO ][o.o.p.PluginsService ] [08af92867f17] loaded module [geo]
...
[2023-08-28T22:03:01,707][INFO ][o.o.p.PluginsService ] [08af92867f17] loaded plugin [opensearch-knn]
We use them on nominations. We also use them in our Review then Commit policy on PRs where we require at least one approval and no changes / unresolved conversations (blocking vetos). The ASF has a good definition of Review then Commit I think we should lean into.
If it's not obvious yet, I have a strong (veto, -1) opinion about breaking semver. We absolutely enforce it with functionality, external APIs (REST), and more. So this poses a certain dilemma about how to decide, no?
We absolutely enforce it with functionality, external APIs (REST), and more. Where did we not?
We try to support it with DSL (as long as a developer remembers to put .onOrAfter(
) and we get it as part of Lucene's index file metadata checks. Outside that we have little to no support for semver outside maintainers reviewing PRs for any violations. That's clearly not working as this wouldn't have happened. But it did and because it slipped through we need to hot patch.
@reta Maybe, but we did. We can call it a mistake, but it's too late, because now we would have to break semver. I don't understand how we can claim it doesn't when a user that enabled this feature cannot upgrade without breaking their system.
We do break semver and I feel bad about it. Why I think it is not too late - we made just one release with ZSTD, and yes, it will affect some users but at the end will reward more users: the feature will be promoted to GA when it is at least well tested and verified. Just my few cents, but I share your sentiment - I don't like that we have to do that but there are possible risks we should better be ready for.
...maintainers reviewing PRs for any violations. That's clearly not working as this wouldn't have happened. But it did and because it slipped through we need to hot patch.
I don't think this was a case of accidental promotion and I don't think the promotion itself was a violation of semver, as we do have precedent for launching new opt-in features in minor versions that previous versions would not be able to handle.
However, we have absolutely identified gaps since the launch of this feature. Some have been mitigated, but to summarize the outstanding technical issues from the comments above:
That brings me back to my question: is this feature safe for production usage after the patch? @nknize obviously feels strongly that the answer is no, and I think @reta is in the "no" camp as well. @backslasht / @shwetathareja do you have any counter points to mitigate these concerns?
I don't think this was a case of accidental promotion.
The promotion in and of itself is not a violation of semver. The decision to promote based on long tail semver requirements of deprecation and removal absolutely would have been raised and prevented the premature promotion. Now we're in the position of hot patching to demote. Which is not uncharted territory in the software world, nor is it that complicated. So I don't understand the position of using semver as an argument to carry a trappy untested feature that never should've been promoted in the first place. It was a mistake to promote this untested feature, mistakes happen. Hot patch to fix the mistake and let users that care opt in and move on. I'm amazed it's this controversial.
One option to consider to prevent the OS from force killing the jvm due to direct memory oom would be to add a default -XX:MaxDirectMemorySize=
at startup. What default is chosen matters (e.g., at elastic we chose 0.5 * max heap
which was recommended no more than 50% total machine memory. this caused some user bug reports when it was first introduced, however). The side effect of that would be increased page cache misses on MMapDirectory which is now in the critical path for any future lucene files not explicitly set to use NIOFs. Paper cuts add up...
I agree with @andrross, this was not accidentally promoted. There was no guideline on the native usage within the core when the original PR was approved and subsequently released in 2.9. As a community, I'm ready to accept new restrictions (no native usage in core), provided we have a discussion around that.
Taking a step back, there are at least two different things being discussed here.
On # 1, I agree with @dblock that this breaks the semver and we shouldn't do it unless we strongly feel no native code in core. On # 2, There are valid concerns on memory leak which is fixed now. Post which, there are efforts in opensearch-project/OpenSearch#9502 to measure the leak which is still in progress. I think we should wait for it to complete and call it experimental only if we still see memory related issues. If you think we need to do more tests, lets add it to opensearch-project/OpenSearch#9502.
Working backward from users, the ZSTD codec is not enabled by default and is still optional for users to enable it per index. And in case a user is going to try out, at that point it doesn't matter whether the code was distributed as core/ plugin. It will run in the same process. What is important (others also called out) is How do we ensure it is production ready? Why do you think we will not have same conversation in 3.0 in case we mark it experimental now. We should run long running tests with different workloads to catch any memory leak issues. Also, brainstorm what kind of testing we should do that provides us the sufficient confidence.
My 2 cents as an opensearch user that run multiple clusters in production. It's surprising to hear that this feature was promoted accidentally since there was blog post where zstd was clearly highlighted. To me it seems like this feature will benefit many users as it both storage saving and performance improvement feature. We enabled zstd codec as soon as we upgraded to 2.9, although we revered the codec as it caused issues for us due to memory leak, but we are still looking forward to use it once a patch version released.
Hindsight is of course 20/20, but....:
Unfortunately this code was prematurely released GA when it was migrated as a top level module a short time later in opensearch-project/OpenSearch#7908 without any feature flags.
Besides discussions about native code in core, and semver (which we already don't succeed in testing/achieving/enforcing today?) somehow weirdly being able to block fixing a clear mistake, this seems quite reckless, to make such a major index-format impacting, native code wrapping, change so shortly after the feature had birthed into Sandbox, without gating it with a feature flag, and marking the Codec as experimental?
We should step back and understand why this happened and how to be more careful/explicit in the future. Things start and live and mature in Sandbox for a good reason: serious issues like memory leaks and severe bugs get ferreted out there. When things get promoted out of Sandbox, they are marked experimental and gated with opt-in feature flags. When they are adopted and loved by users, and there are not serious issues, the experimental caveat can be removed and the feature lives on.
It also sounds like we have poor testing around Zstd. Why didn't our tests find these issues? Does OpenSearch randomly rotate all Codecs in when running tests, like Lucene's tests? We shouldn't promote something so fundamental as Zstd out of Sandbox without strong test coverage. That needs to be addressed too? What other issues are lurking with this Zstd implementation? We cannot claim this is production ready if we have not tested it properly. It is still experimental and we should tell our users so.
What about backwards compatibility? Does OpenSearch guarantee this? The release policy description does not seem to say one way or another about the search index, or perhaps implies that major releases can break searching of indices from the last major release? But if so, how is a rolling upgrade done ... if a user builds a Zstd compressed index today, can they continue to search it in future major OpenSearch releases? For how long? (Lucene has strict policies here... the index written by the default Codec will still be readable in the next major release series, but not beyond).
Clearly Zstd is an important feature! Better and faster compression. Of course all users would choose this :) I hope Lucene will adopt it by default, eventually. But it's clearly not there yet, and OpenSearch users should know they are explicitly opting into an experimental feature in the meantime.
Anyway, in addition to fixing the root cause here ("why did we so recklessly promote a feature out of Sandbox w/o proper testing, gating with a feature flag, marking experimental"), +1 to @nknize's opening proposal -- gate Zstd behind an opt-in feature flag, make it clear that it's still experimental, no back compat promise, etc. in a quick bug fix release. With such severe issues (memory leaks, exceptions) users cannot possibly succeed in using this in production anyways, so the semver argument against correcting this mistake makes zero sense to me.
Is your feature request related to a problem? Please describe.
ZStd was introduced as an experimental compression option for Lucene indexes in opensearch-project/OpenSearch#3577. This brought the implementation as a
module
(installed by default) under thesandbox
directory. However, the feature would remain "expert" optional as it would only be installed if users built the distribution themselves and passedsandbox.enabled=true
at JVM startup.Unfortunately this code was prematurely released GA when it was migrated as a top level module a short time later in opensearch-project/OpenSearch#7908 without any feature flags. This has now lead to users reporting memory leaks along with several other bugs in the zstd jni library. Additionally, Lucene has a long running discussion on the pros/cons of Zstd as an index compression option including the reasons it's not provided as a core capability. One of those reasons is the hard dependency on native compiled code, which often leads to portability issues due to glibc differences. These issues have been realized several times both in the OpenSearch bundle (e.g., see the KNN issue where users can't compile on M1 Mac) and legacy codebase .
For these reasons, we need to address the premature promotion of the zstd library as a GA / LTS feature in core, and (at minimum) release a patched bundle that fixes the critical performance issues and bugs.
Describe the solution you'd like
Quickly build and release a 2.9.1 bundle distribution with five patches.
ZSTD_COMPRESSION_EXPERIMENTAL = "opensearch.experimental.feature.compression.zstd.enabled
feature flag that is set tofalse
by default (forcing users to opt in).ZSTD_COMPRESSION_EXPERIMENTAL
feature flag both to the CodecService constructor and theCompressionProvider.getCompressors()
(used for BlobStoreRepository compression).DeprecationLogger
message that the zstd feature will be moved to a plugin in the next release1.5.5-3
to1.5.5-5
opensearch-project/OpenSearch#9431 (NEEDS TO BE BACKPORTED)In 2.10.0 (or later even) we should decide the following:
ZstdCodec
andZstdNoDictCodec
out from being a default module into an optional location (e.g., either an optional plugin or library - note that the BlobStoreRepository compression is already in as an optional library, but its packaged and included by default so we still need to figure out how to make that optional).Describe alternatives you've considered