[Demote] ZStd compression from GA / LTS to experimental and release a 2.9.1 patch

nknize commented 9 months ago

Is your feature request related to a problem? Please describe.

ZStd was introduced as an experimental compression option for Lucene indexes in opensearch-project/OpenSearch#3577. This brought the implementation as a module (installed by default) under the sandbox directory. However, the feature would remain "expert" optional as it would only be installed if users built the distribution themselves and passed sandbox.enabled=true at JVM startup.

Unfortunately this code was prematurely released GA when it was migrated as a top level module a short time later in opensearch-project/OpenSearch#7908 without any feature flags. This has now lead to users reporting memory leaks along with several other bugs in the zstd jni library. Additionally, Lucene has a long running discussion on the pros/cons of Zstd as an index compression option including the reasons it's not provided as a core capability. One of those reasons is the hard dependency on native compiled code, which often leads to portability issues due to glibc differences. These issues have been realized several times both in the OpenSearch bundle (e.g., see the KNN issue where users can't compile on M1 Mac) and legacy codebase .

For these reasons, we need to address the premature promotion of the zstd library as a GA / LTS feature in core, and (at minimum) release a patched bundle that fixes the critical performance issues and bugs.

Describe the solution you'd like

Quickly build and release a 2.9.1 bundle distribution with five patches.

Zstd-jni memory leak fix in: opensearch-project/OpenSearch#9403
Add a new ZSTD_COMPRESSION_EXPERIMENTAL = "opensearch.experimental.feature.compression.zstd.enabled feature flag that is set to false by default (forcing users to opt in).
Apply the ZSTD_COMPRESSION_EXPERIMENTAL feature flag both to the CodecService constructor and the CompressionProvider.getCompressors() (used for BlobStoreRepository compression).
Add a DeprecationLogger message that the zstd feature will be moved to a plugin in the next release
Bump the zstd library dependency from 1.5.5-3 to 1.5.5-5 opensearch-project/OpenSearch#9431 (NEEDS TO BE BACKPORTED)

In 2.10.0 (or later even) we should decide the following:

Move the ZstdCodec and ZstdNoDictCodec out from being a default module into an optional location (e.g., either an optional plugin or library - note that the BlobStoreRepository compression is already in as an optional library, but its packaged and included by default so we still need to figure out how to make that optional).
Whether to switch to direct memory or introduce an expert setting that gives users the option to use direct or heap memory when using ZStd compression

Describe alternatives you've considered

Move ZStd codec and BlobStore compression to a module in a patch release.
Revert opensearch-project/OpenSearch#7908 in a 2.9.1 patch release

bbarani commented 9 months ago

@nknize we cannot add new features in patch release (i.e 2.9.1) rather only bug fix and security patches as per SemVar guidelines. We can definitely get the bug fix in but I am not sure about adding experimental flag since that might be considered a feature, so we can target those changes only for next minor release.

nknize commented 9 months ago

...we cannot add new features in patch release (i.e 2.9.1) ....per SemVar guidelines. ...I am not sure if adding experimental flag since that would be considered a feature,...

@bbarani Unfortunately we can't allow ZStd to remain available by default because rolling this back requires users to reindex if they chose these codecs. We have to revert this feature ASAP before users start indexing. This was mistakenly released GA, hence the patch.

andrross commented 9 months ago

Regarding the option to revert opensearch-project/OpenSearch#7908, I just want to note that will leave any users who indexed data with zstd using 2.9.0 in a bad spot. Their only option would be to reindex that data with 2.9.0 before upgrading to 2.9.1, since the 2.9.1 release will not contain the sandbox plugin functionality.

nknize commented 9 months ago

Great point @andrross! I'm adding context from follow on discussion from slack just so we don't lose it.

Barani Bikshandi (Barani): Yeah but can this be remediated with "revert" rather than adding experimental flag for 2.9.1? Basically remove support for zstd in 2.9.1? (edited)

I added this as an alternative to discuss.

backslasht commented 9 months ago

One of those reasons is the hard dependency on native compiled code, which often leads to portability issues due to glibc differences

@nknize - What is the general guideline for any native dependency? Are you saying OpenSearch will never support native dependencies going forward?

Move the ZstdCodec and ZstdNoDictCodec out from being a default module into an optional location (e.g., either an optional plugin or library - note that the BlobStoreRepository compression is already in as an optional library, but its packaged and included by default so we still need to figure out how to make that optional).

What is the intent of this? By moving to optional are you suggesting users to use Zstd at their own risk? Zstd is providing 30% reduction in storage size and latencies as good as best_speed. What is the alternative to this?

nknize commented 9 months ago

Are you saying OpenSearch will never support native dependencies going forward?

As plugins, sure. But likely not as default modules. Plugins are just like modules except users have to opt in by installing them.

By moving to optional are you suggesting users to use Zstd at their own risk?

Depends on how you define "risk". ldd errors due to libc issues is certainly a "risk". An alternative to the jni implementation would be to use the pure java solution. Then we could include it as a module without portability risks.

Zstd is providing 30% reduction in storage size and latencies as good as best_speed.

That's only part of the picture. Retrieval time per 10k docs is nearly five times slower than BEST_SPEED. So it's not without its tradeoffs.

sarthakaggarwal97 commented 9 months ago

@nknize @backslasht sharing some search numbers from my runs with NYC Taxis

Cluster Configuration:

Server Setup a. One data node, r5.2xlarge backed by EBS GP3 volume of size 100gb. b. Three master nodes, r5.xlarge instance type
Benchmarking Setup Benchmark was run on a single node of type c5.4xlarge backed by EBS GP3 volume of size 500gb. a. Number of clients: 16 and bulk size: 1024. b. Workload: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis
Index Setup a. Number of shards: 1 b. Number of replicas: 0
Benchmarks

backslasht commented 9 months ago

@sarthakaggarwal97 - Can you add details about the cluster configuration for the run?

backslasht commented 9 months ago

Depends on how you define "risk". ldd errors due to libc issues is certainly a "risk". An alternative to the jni implementation would be to use the pure java solution. Then we could include it as a module without portability risks.

Fair point!

sarthakaggarwal97 commented 9 months ago

Can you add details about the cluster configuration for the run?

I've updated it in my previous comment alongside the numbers, thanks @backslasht

reta commented 9 months ago

@sarthakaggarwal97 sadly the latency charts do not reflect memory leaks (especially the native one)

An alternative to the jni implementation would be to use the pure java solution.

@nknize afaik the pure Java implementation exists and was evaluated with terrible performance (comparing to ZSTD-jni)

nknize commented 9 months ago

...sharing some search numbers from my runs with NYC Taxis

@sarthakaggarwal97 can you add two things to your nyc_taxi benchmark run:

this term -> date_histogram -> top_hits agg in #1647?
"store": true to the vendor_name field

The reasoning is: regressions have historically occurred when decompressing stored fields and I don't see any stored fields in the nyc_taxis workload. We should make this part of the regular benchmark to prevent us from missing another regression.

...afaik the pure Java implementation exists and was evaluated with terrible performance (comparing to ZSTD-jni)

@reta Do you have that benchmark handy? I haven't dug around for it and the AirCompressor README touts "...typically 10-40% faster than the JNI wrapper for the native libraries." The only thing I quickly dug up was the comment justifying the purpose of the library, which aligns with our purpose of avoiding strict native dependencies.

@sarthakaggarwal97 It's also worth running luceneutil's StoredFieldsBenchmark with the ZStdCodec. We could do this in a separate repo and even pull in the AirCompressor pure java implementation for comparison.

reta commented 9 months ago

@reta Do you have that benchmark handy?

Hm ... I definitely seen somewhere, there are mentions here https://github.com/apache/lucene/issues/9784#issuecomment-1223937568, and here https://github.com/opensearch-project/OpenSearch/issues/3354 for LZ4 (not applicable to ZSTD).

There are benchmarks available in the repo, I will run them: https://github.com/airlift/aircompressor/blob/master/src/test/java/io/airlift/compress/benchmark/CompressionBenchmark.java

mgodwan commented 9 months ago

The reasoning is: regressions have https://github.com/opensearch-project/OpenSearch/issues/1647#issuecomment-986964918 when decompressing stored fields and I don't see any stored fields in the nyc_taxis workload. We should make this part of the regular benchmark to prevent us from missing another regression.

@nknize Shouldn't _source cover the stored field as it is returned for the configured default/range queries in nyc taxis?

reta commented 9 months ago

Sorry this issue was closed automatically after https://github.com/opensearch-project/OpenSearch/pull/9431

nknize commented 9 months ago

Shouldn't _source cover the stored field as it is returned for the configured default/range queries in nyc taxis?

@mgodwan For queries, _source is only fetched for the first 10 results. TopHitsAggregator is particularly of interest (and adversarial) because it executes the fetchPhase for topDocs.size() when the aggregation is built. This is why that regression linked above wasn't caught before the release, the benchmark workloads didn't test for these potentially "evil" aggregations.

reta commented 9 months ago

@nknize I've run aircompressor benchmarks locally on my machine ( i7-10750H × 12, 64Gb, Linux):

Here are the compression mem throughput:

  compress    airlift_zstd            canterbury/alice29.txt          56,746   101.3MB/s ±   796.4kB/s ( 0.77%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/asyoulik.txt         50,753    76.0MB/s ±  9315.6kB/s (11.97%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/cp.html               8,566    93.3MB/s ±    15.5MB/s (16.64%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/fields.c              3,427   102.8MB/s ±    25.6MB/s (24.94%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/grammar.lsp           1,327    85.0MB/s ±    12.5MB/s (14.73%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/kennedy.xls         112,690   184.5MB/s ±  5800.4kB/s ( 3.07%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/lcet10.txt          140,687   104.2MB/s ±  3184.5kB/s ( 2.99%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/plrabn12.txt        191,230    79.8MB/s ±  2585.3kB/s ( 3.16%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/ptt5                 54,581   262.0MB/s ±    20.9MB/s ( 7.97%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/sum                  13,408    99.3MB/s ±  4520.0kB/s ( 4.45%) (N = 30, α = 99.9%)
  compress    airlift_zstd            canterbury/xargs.1               1,838    92.5MB/s ±  6027.6kB/s ( 6.36%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/dickens              3,666,832    99.5MB/s ±  2951.9kB/s ( 2.90%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/mozilla             18,577,375   124.3MB/s ±  7819.5kB/s ( 6.14%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/mr                   3,560,793   115.2MB/s ±  4011.2kB/s ( 3.40%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/nci                  2,889,641   400.0MB/s ±    15.0MB/s ( 3.75%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/ooffice              3,147,867    92.3MB/s ±  8095.0kB/s ( 8.56%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/osdb                 3,515,524   131.7MB/s ±  4193.8kB/s ( 3.11%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/reymont              1,958,308   139.9MB/s ±  1997.9kB/s ( 1.39%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/samba                5,097,907   187.4MB/s ±  7248.4kB/s ( 3.78%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/sao                  5,591,044    62.2MB/s ±    10.1MB/s (16.25%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/webster             12,108,729   105.5MB/s ±  9211.8kB/s ( 8.53%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/x-ray                6,229,560    57.9MB/s ±  7637.8kB/s (12.87%) (N = 30, α = 99.9%)
  compress    airlift_zstd            silesia/xml                    641,486   240.9MB/s ±    21.1MB/s ( 8.74%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/bib                     37,329    98.0MB/s ± 10154.2kB/s (10.12%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/book1                  307,645    85.5MB/s ±  6029.9kB/s ( 6.89%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/book2                  205,814   106.1MB/s ±  5918.6kB/s ( 5.45%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/geo                     69,249    61.4MB/s ±  1123.1kB/s ( 1.79%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/news                   139,728    89.0MB/s ±  8312.1kB/s ( 9.12%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/obj1                    10,786    98.6MB/s ±  4930.9kB/s ( 4.88%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/obj2                    83,275   116.6MB/s ±  3592.6kB/s ( 3.01%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/paper1                  19,694   106.4MB/s ±  3150.5kB/s ( 2.89%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/paper2                  30,900    88.2MB/s ±  8110.2kB/s ( 8.98%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/paper3                  18,932    90.5MB/s ±  5466.5kB/s ( 5.90%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/paper4                   5,760    96.8MB/s ±  1023.9kB/s ( 1.03%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/paper5                   5,246    98.5MB/s ±  3130.5kB/s ( 3.10%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/paper6                  14,268   105.9MB/s ±  2828.0kB/s ( 2.61%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/pic                     54,581   315.1MB/s ±  3718.4kB/s ( 1.15%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/progc                   14,421    94.3MB/s ±  8889.6kB/s ( 9.20%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/progl                   17,877   147.8MB/s ±    10.5MB/s ( 7.12%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/progp                   12,362   160.3MB/s ±  2177.4kB/s ( 1.33%) (N = 30, α = 99.9%)
  compress    airlift_zstd            calgary/trans                   20,790   186.2MB/s ±  3830.5kB/s ( 2.01%) (N = 30, α = 99.9%)
  compress    airlift_zstd            artificial/a.txt                    14  1050.6kB/s ±    38.1kB/s ( 3.63%) (N = 30, α = 99.9%)
  compress    airlift_zstd            artificial/aaa.txt                  26  1813.0MB/s ±    88.1MB/s ( 4.86%) (N = 30, α = 99.9%)
  compress    airlift_zstd            artificial/alphabet.txt             50  1797.3MB/s ±    58.5MB/s ( 3.25%) (N = 30, α = 99.9%)
  compress    airlift_zstd            artificial/random.txt           75,421   317.3MB/s ±  3277.7kB/s ( 1.01%) (N = 30, α = 99.9%)
  compress    airlift_zstd            artificial/uniform_ascii.bin        8,842   226.9MB/s ±    20.2MB/s ( 8.91%) (N = 30, α = 99.9%)
  compress    airlift_zstd            large/bible.txt              1,183,732   129.2MB/s ±  6002.6kB/s ( 4.54%) (N = 30, α = 99.9%)
  compress    airlift_zstd            large/E.coli                 1,413,593   142.7MB/s ±  2199.7kB/s ( 1.51%) (N = 30, α = 99.9%)
  compress    airlift_zstd            large/world192.txt             659,802   142.2MB/s ±  3032.7kB/s ( 2.08%) (N = 30, α = 99.9%)
  compress    airlift_zstd            geo.protodata                   14,096   344.6MB/s ±    14.6MB/s ( 4.24%) (N = 30, α = 99.9%)
  compress    airlift_zstd            house.jpg                      126,974   259.0MB/s ±  9267.6kB/s ( 3.49%) (N = 30, α = 99.9%)
  compress    airlift_zstd            html                            14,928   276.9MB/s ±  8902.3kB/s ( 3.14%) (N = 30, α = 99.9%)
  compress    airlift_zstd            kppkn.gtb                       41,647   152.9MB/s ±  8482.8kB/s ( 5.42%) (N = 30, α = 99.9%)
  compress    airlift_zstd            mapreduce-osdi-1.pdf            75,523   351.6MB/s ±  5871.6kB/s ( 1.63%) (N = 30, α = 99.9%)
  compress    airlift_zstd            urls.10K                       185,565   146.7MB/s ±  1657.5kB/s ( 1.10%) (N = 30, α = 99.9%)

vs

  compress    zstd_jni                canterbury/alice29.txt          56,271   160.6MB/s ±  3078.0kB/s ( 1.87%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/asyoulik.txt         50,363   154.6MB/s ±  1365.1kB/s ( 0.86%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/cp.html               8,465   232.3MB/s ±  8557.1kB/s ( 3.60%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/fields.c              3,379   280.2MB/s ±    10.6MB/s ( 3.79%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/grammar.lsp           1,290   264.9MB/s ±  4227.8kB/s ( 1.56%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/kennedy.xls         111,742   415.8MB/s ±  6385.0kB/s ( 1.50%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/lcet10.txt          139,324   166.4MB/s ±  6885.7kB/s ( 4.04%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/plrabn12.txt        190,276   115.1MB/s ±    15.7MB/s (13.62%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/ptt5                 54,446   521.6MB/s ±    44.8MB/s ( 8.59%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/sum                  13,375   228.5MB/s ±  6798.6kB/s ( 2.91%) (N = 30, α = 99.9%)
  compress    zstd_jni                canterbury/xargs.1               1,800   231.7MB/s ±  4311.6kB/s ( 1.82%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/dickens              3,631,607   164.6MB/s ±  1239.1kB/s ( 0.74%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/mozilla             18,487,478   254.5MB/s ±  8180.9kB/s ( 3.14%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/mr                   3,547,800   194.6MB/s ±  2049.6kB/s ( 1.03%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/nci                  2,843,995   690.5MB/s ±  7290.3kB/s ( 1.03%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/ooffice              3,143,604   153.3MB/s ±  8214.5kB/s ( 5.23%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/osdb                 3,506,444   221.6MB/s ±  7585.6kB/s ( 3.34%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/reymont              1,942,004   200.3MB/s ±  9344.7kB/s ( 4.56%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/samba                4,976,446   309.1MB/s ±    30.8MB/s ( 9.97%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/sao                  5,551,154   123.5MB/s ±    14.9MB/s (12.10%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/webster             11,981,183   189.5MB/s ±  7174.5kB/s ( 3.70%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/x-ray                6,085,291   124.5MB/s ±  2117.3kB/s ( 1.66%) (N = 30, α = 99.9%)
  compress    zstd_jni                silesia/xml                    639,134   502.7MB/s ±  9095.5kB/s ( 1.77%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/bib                     37,046   192.5MB/s ±  4947.9kB/s ( 2.51%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/book1                  305,543   144.0MB/s ±  3083.5kB/s ( 2.09%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/book2                  203,941   178.5MB/s ±  4210.9kB/s ( 2.30%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/geo                     69,219   114.7MB/s ±  4773.0kB/s ( 4.06%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/news                   138,021   166.0MB/s ±  5224.0kB/s ( 3.07%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/obj1                    10,772   198.7MB/s ±  7577.9kB/s ( 3.72%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/obj2                    83,359   182.4MB/s ±  5750.3kB/s ( 3.08%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/paper1                  19,511   176.0MB/s ±  5514.1kB/s ( 3.06%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/paper2                  30,618   168.0MB/s ±  3833.8kB/s ( 2.23%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/paper3                  18,736   151.9MB/s ±  6142.8kB/s ( 3.95%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/paper4                   5,679   184.6MB/s ±  4219.1kB/s ( 2.23%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/paper5                   5,152   195.2MB/s ±  2426.9kB/s ( 1.21%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/paper6                  14,029   185.4MB/s ±  5817.8kB/s ( 3.07%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/pic                     54,446   576.1MB/s ±  7905.9kB/s ( 1.34%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/progc                   14,167   194.4MB/s ±    11.7MB/s ( 6.02%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/progl                   17,581   271.3MB/s ±  8018.9kB/s ( 2.89%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/progp                   12,149   289.7MB/s ±  3417.5kB/s ( 1.15%) (N = 30, α = 99.9%)
  compress    zstd_jni                calgary/trans                   20,595   331.0MB/s ±  3422.0kB/s ( 1.01%) (N = 30, α = 99.9%)
  compress    zstd_jni                artificial/a.txt                    10  1307.6kB/s ±    38.1kB/s ( 2.91%) (N = 30, α = 99.9%)
  compress    zstd_jni                artificial/aaa.txt                  22  4320.9MB/s ±   389.5MB/s ( 9.01%) (N = 30, α = 99.9%)
  compress    zstd_jni                artificial/alphabet.txt             46  5031.7MB/s ±   358.4MB/s ( 7.12%) (N = 30, α = 99.9%)
  compress    zstd_jni                artificial/random.txt           75,048   979.2MB/s ±    40.3MB/s ( 4.11%) (N = 30, α = 99.9%)
  compress    zstd_jni                artificial/uniform_ascii.bin        8,838   614.6MB/s ±  4906.5kB/s ( 0.78%) (N = 30, α = 99.9%)
  compress    zstd_jni                large/bible.txt              1,173,315   214.6MB/s ±  2139.8kB/s ( 0.97%) (N = 30, α = 99.9%)
  compress    zstd_jni                large/E.coli                 1,392,186   208.4MB/s ±  4167.2kB/s ( 1.95%) (N = 30, α = 99.9%)
  compress    zstd_jni                large/world192.txt             650,815   217.8MB/s ±  6117.0kB/s ( 2.74%) (N = 30, α = 99.9%)
  compress    zstd_jni                geo.protodata                   14,079   682.8MB/s ±    29.8MB/s ( 4.37%) (N = 30, α = 99.9%)
  compress    zstd_jni                house.jpg                      126,970   881.7MB/s ±    28.2MB/s ( 3.20%) (N = 30, α = 99.9%)
  compress    zstd_jni                html                            14,798   512.1MB/s ±    16.3MB/s ( 3.18%) (N = 30, α = 99.9%)
  compress    zstd_jni                kppkn.gtb                       40,850   258.4MB/s ±    15.3MB/s ( 5.94%) (N = 30, α = 99.9%)
  compress    zstd_jni                mapreduce-osdi-1.pdf            75,594   515.8MB/s ±    35.0MB/s ( 6.78%) (N = 30, α = 99.9%)
  compress    zstd_jni                urls.10K                       183,492   235.6MB/s ±  5904.0kB/s ( 2.45%) (N = 30, α = 99.9%)

Same for decompression:

  decompress  airlift_zstd            canterbury/alice29.txt          56,746   456.4MB/s ±    21.9MB/s ( 4.80%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/asyoulik.txt         50,753   490.3MB/s ±    21.7MB/s ( 4.43%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/cp.html               8,566   634.3MB/s ±  4931.3kB/s ( 0.76%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/fields.c              3,427   551.0MB/s ±    13.6MB/s ( 2.46%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/grammar.lsp           1,327   405.1MB/s ±    11.3MB/s ( 2.79%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/kennedy.xls         112,690   663.5MB/s ±  4931.5kB/s ( 0.73%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/lcet10.txt          140,687   559.4MB/s ±    24.5MB/s ( 4.38%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/plrabn12.txt        191,230   481.3MB/s ±    15.7MB/s ( 3.27%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/ptt5                 54,581  1135.9MB/s ±    16.2MB/s ( 1.43%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/sum                  13,408   590.6MB/s ±    21.0MB/s ( 3.56%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            canterbury/xargs.1               1,838   372.8MB/s ±    13.3MB/s ( 3.56%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/dickens              3,666,832   503.4MB/s ±    12.1MB/s ( 2.40%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/mozilla             18,577,375   562.5MB/s ±    18.5MB/s ( 3.29%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/mr                   3,560,793   525.6MB/s ±    29.3MB/s ( 5.58%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/nci                  2,889,641  1141.0MB/s ±    42.9MB/s ( 3.76%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/ooffice              3,147,867   404.2MB/s ±    22.0MB/s ( 5.45%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/osdb                 3,515,524   659.0MB/s ±    19.0MB/s ( 2.88%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/reymont              1,958,308   593.1MB/s ±    26.1MB/s ( 4.40%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/samba                5,097,907   755.9MB/s ±    25.1MB/s ( 3.32%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/sao                  5,591,044   430.6MB/s ±    17.3MB/s ( 4.02%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/webster             12,108,729   564.4MB/s ±    30.1MB/s ( 5.33%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/x-ray                6,229,560   374.6MB/s ±    15.2MB/s ( 4.06%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            silesia/xml                    641,486  1064.9MB/s ±    27.1MB/s ( 2.55%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/bib                     37,329   606.5MB/s ±    12.7MB/s ( 2.09%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/book1                  307,645   468.3MB/s ±    15.2MB/s ( 3.25%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/book2                  205,814   554.2MB/s ±    20.1MB/s ( 3.62%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/geo                     69,249   420.8MB/s ±    24.7MB/s ( 5.88%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/news                   139,728   580.7MB/s ±  4045.2kB/s ( 0.68%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/obj1                    10,786   512.7MB/s ±    15.8MB/s ( 3.08%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/obj2                    83,275   563.4MB/s ±  1953.8kB/s ( 0.34%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/paper1                  19,694   573.3MB/s ±  3471.3kB/s ( 0.59%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/paper2                  30,900   534.0MB/s ±    21.7MB/s ( 4.07%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/paper3                  18,932   515.5MB/s ±    11.8MB/s ( 2.30%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/paper4                   5,760   437.3MB/s ±  3951.0kB/s ( 0.88%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/paper5                   5,246   433.8MB/s ±  7494.2kB/s ( 1.69%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/paper6                  14,268   534.5MB/s ±    19.4MB/s ( 3.63%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/pic                     54,581  1140.7MB/s ±    14.2MB/s ( 1.24%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/progc                   14,421   584.6MB/s ±  5636.1kB/s ( 0.94%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/progl                   17,877   693.9MB/s ±    26.3MB/s ( 3.79%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/progp                   12,362   762.3MB/s ±    13.3MB/s ( 1.74%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            calgary/trans                   20,790   824.6MB/s ±  9032.4kB/s ( 1.07%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            artificial/a.txt                    14    42.3MB/s ±  1725.0kB/s ( 3.99%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            artificial/aaa.txt                  26  4627.3MB/s ±    69.2MB/s ( 1.50%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            artificial/alphabet.txt             50  4338.0MB/s ±    15.6MB/s ( 0.36%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            artificial/random.txt           75,421   494.6MB/s ±    21.7MB/s ( 4.38%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            artificial/uniform_ascii.bin        8,842   504.1MB/s ±  5961.5kB/s ( 1.15%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            large/bible.txt              1,183,732   620.7MB/s ±  6002.5kB/s ( 0.94%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            large/E.coli                 1,413,593   525.6MB/s ±    15.6MB/s ( 2.96%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            large/world192.txt             659,802   677.7MB/s ±    12.6MB/s ( 1.86%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            large/bible.txt              1,183,732   620.7MB/s ±  6002.5kB/s ( 0.94%) (N = 30, α = 99.9%) 
  decompress  airlift_zstd            large/E.coli                 1,413,593   525.6MB/s ±    15.6MB/s ( 2.96%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            large/world192.txt             659,802   677.7MB/s ±    12.6MB/s ( 1.86%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            geo.protodata                   14,096  1342.8MB/s ±  8880.6kB/s ( 0.65%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            house.jpg                      126,974  9324.2MB/s ±   482.0MB/s ( 5.17%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            html                            14,928  1163.7MB/s ±    90.9MB/s ( 7.81%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            kppkn.gtb                       41,647   536.1MB/s ±    45.3MB/s ( 8.45%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            mapreduce-osdi-1.pdf            75,523   988.7MB/s ±   164.2MB/s (16.61%) (N = 30, α = 99.9%)
  decompress  airlift_zstd            urls.10K                       185,565   538.9MB/s ±   115.6MB/s (21.45%) (N = 30, α = 99.9%)

vs

  decompress  zstd_jni                canterbury/alice29.txt          56,271   855.6MB/s ±    48.4MB/s ( 5.66%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/asyoulik.txt         50,363  1019.2MB/s ±    32.6MB/s ( 3.20%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/cp.html               8,465  1122.5MB/s ±    46.1MB/s ( 4.11%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/fields.c              3,379   958.6MB/s ±    23.5MB/s ( 2.45%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/grammar.lsp           1,290   671.6MB/s ±    14.1MB/s ( 2.10%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/kennedy.xls         111,742  1150.7MB/s ±    46.3MB/s ( 4.02%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/lcet10.txt          139,324  1120.4MB/s ±    32.3MB/s ( 2.88%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/plrabn12.txt        190,276   930.2MB/s ±    31.1MB/s ( 3.34%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/ptt5                 54,446  2138.5MB/s ±    52.5MB/s ( 2.45%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/sum                  13,375  1273.0MB/s ±  9259.5kB/s ( 0.71%) (N = 30, α = 99.9%)
  decompress  zstd_jni                canterbury/xargs.1               1,800   682.0MB/s ± 10101.9kB/s ( 1.45%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/dickens              3,631,607   882.2MB/s ±    41.8MB/s ( 4.74%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/mozilla             18,487,478  1220.1MB/s ±    46.6MB/s ( 3.82%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/mr                   3,547,800  1101.9MB/s ±    26.3MB/s ( 2.38%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/nci                  2,843,995  1832.1MB/s ±    98.4MB/s ( 5.37%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/ooffice              3,143,604   905.3MB/s ±    44.9MB/s ( 4.96%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/osdb                 3,506,444  1351.4MB/s ±    93.7MB/s ( 6.93%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/reymont              1,942,004  1040.3MB/s ±    49.9MB/s ( 4.80%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/samba                4,976,446  1717.6MB/s ±    15.4MB/s ( 0.90%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/sao                  5,551,154  1019.3MB/s ±  3488.9kB/s ( 0.33%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/webster             11,981,183  1034.2MB/s ±    39.7MB/s ( 3.83%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/x-ray                6,085,291   904.8MB/s ±  8415.4kB/s ( 0.91%) (N = 30, α = 99.9%)
  decompress  zstd_jni                silesia/xml                    639,134  2177.3MB/s ±    11.8MB/s ( 0.54%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/bib                     37,046  1164.0MB/s ±    41.6MB/s ( 3.57%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/book1                  305,543   990.4MB/s ±  9123.6kB/s ( 0.90%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/book2                  203,941  1133.0MB/s ±  5780.3kB/s ( 0.50%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/geo                     69,219  1149.4MB/s ±    32.7MB/s ( 2.85%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/news                   138,021  1220.7MB/s ±  7415.9kB/s ( 0.59%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/obj1                    10,772  1274.9MB/s ±  4696.8kB/s ( 0.36%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/obj2                    83,359  1008.0MB/s ±    39.7MB/s ( 3.94%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/paper1                  19,511  1042.7MB/s ±    66.6MB/s ( 6.39%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/paper2                  30,618   910.4MB/s ±   124.5MB/s (13.67%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/paper3                  18,736   912.9MB/s ±   108.4MB/s (11.87%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/paper4                   5,679   728.4MB/s ±    70.5MB/s ( 9.67%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/paper5                   5,152   758.3MB/s ±    65.2MB/s ( 8.60%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/paper6                  14,029  1112.4MB/s ±  7842.4kB/s ( 0.69%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/pic                     54,446  1922.9MB/s ±   206.2MB/s (10.72%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/progc                   14,167   664.5MB/s ±    71.1MB/s (10.70%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/progl                   17,581   987.9MB/s ±   124.1MB/s (12.56%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/progp                   12,149  1029.5MB/s ±   129.4MB/s (12.57%) (N = 30, α = 99.9%)
  decompress  zstd_jni                calgary/trans                   20,595  1297.8MB/s ±   145.4MB/s (11.20%) (N = 30, α = 99.9%)
  decompress  zstd_jni                artificial/a.txt                    10  1952.4kB/s ±   228.0kB/s (11.68%) (N = 30, α = 99.9%)
  decompress  zstd_jni                artificial/aaa.txt                  22  6132.1MB/s ±   174.4MB/s ( 2.84%) (N = 30, α = 99.9%)
  decompress  zstd_jni                artificial/alphabet.txt             46  3642.8MB/s ±   113.3MB/s ( 3.11%) (N = 30, α = 99.9%)
  decompress  zstd_jni                artificial/random.txt           75,048  1281.0MB/s ±    86.5MB/s ( 6.76%) (N = 30, α = 99.9%)
  decompress  zstd_jni                artificial/uniform_ascii.bin        8,838  1206.7MB/s ±    45.6MB/s ( 3.78%) (N = 30, α = 99.9%)
  decompress  zstd_jni                large/bible.txt              1,173,315  1041.1MB/s ±    49.5MB/s ( 4.75%) (N = 30, α = 99.9%)
  decompress  zstd_jni                large/E.coli                 1,392,186   995.2MB/s ±    37.7MB/s ( 3.79%) (N = 30, α = 99.9%)
  decompress  zstd_jni                large/world192.txt             650,815  1241.6MB/s ±    56.8MB/s ( 4.57%) (N = 30, α = 99.9%)
  decompress  zstd_jni                geo.protodata                   14,079  2804.8MB/s ±   107.9MB/s ( 3.85%) (N = 30, α = 99.9%)
  decompress  zstd_jni                house.jpg                      126,970    30.9GB/s ±  1466.7MB/s ( 4.64%) (N = 30, α = 99.9%)
  decompress  zstd_jni                html                            14,798  2444.8MB/s ±    78.2MB/s ( 3.20%) (N = 30, α = 99.9%)
  decompress  zstd_jni                kppkn.gtb                       40,850   948.1MB/s ±    52.7MB/s ( 5.56%) (N = 30, α = 99.9%)
  decompress  zstd_jni                mapreduce-osdi-1.pdf            75,594  6662.2MB/s ±   108.4MB/s ( 1.63%) (N = 30, α = 99.9%)
  decompress  zstd_jni                urls.10K                       183,492  1503.3MB/s ±    53.2MB/s ( 3.54%) (N = 30, α = 99.9%)

As per this numbers, aircompressor looks significantly slower.

nknize commented 9 months ago

@reta what is the fourth column?

reta commented 9 months ago

@reta what is the fourth column?

You mean the percentage (like ( 3.54%)): stats.getMeanErrorAt(0.999) * 100 / stats.getMean(),, the whole output is that:

            int compressSize = compressSize(algorithm, name);
            System.out.printf("  %-10s  %-22s  %-25s  %,11d  %10s ± %11s (%5.2f%%) (N = %d, \u03B1 = 99.9%%)\n",
                    result.getPrimaryResult().getLabel(),
                    algorithm,
                    name,
                    compressSize,
                    Util.toHumanReadableSpeed((long) stats.getMean()),
                    Util.toHumanReadableSpeed((long) stats.getMeanErrorAt(0.999)),
                    stats.getMeanErrorAt(0.999) * 100 / stats.getMean(),
                    stats.getN());

nknize commented 9 months ago

Ah.. it's compressSize. I dropped an issue on the Aircompressor repo about these numbers. Hopefully we can reconcile this and see if there are any corners missed that might squeeze out better speed. I have to imagine SIMD could help? Maybe better parallelize the compression through ByteVector usage? I'm naively optimistic about that.

sarthakaggarwal97 commented 9 months ago

@nknize @reta @backslasht @shwetathareja

With the similar cluster configuration mentioned over here, alongside Nick's recommendations

this term -> date_histogram -> top_hits agg in https://github.com/opensearch-project/OpenSearch/issues/1647#issue-1069897761?

"store": true to the vendor_name field

These were the search numbers I got from the benchmarks with NYC Taxi Dataset

LZ4
50th percentile service time,top_agg_hits,1256.558088120073,ms
90th percentile service time,top_agg_hits,1278.5834005102515,ms
100th percentile service time,top_agg_hits,1801.7356134951115,ms

ZSTD NO DICT
50th percentile service time,top_agg_hits,2341.6134444996715,ms
90th percentile service time,top_agg_hits,2377.105840295553,ms
100th percentile service time,top_agg_hits,2634.960106573999,ms

ZSTD
50th percentile service time,top_agg_hits,3103.5107178613544,ms
90th percentile service time,top_agg_hits,3153.532434348017,ms
100th percentile service time,top_agg_hits,3601.2025149539113,ms

ZLIB
50th percentile service time,top_agg_hits,3411.0017553903162,ms
90th percentile service time,top_agg_hits,3443.0927176959813,ms
100th percentile service time,top_agg_hits,3785.180849954486,ms

In the profiles, around 4% of the CPU was consumed for decompression with LZ4 (Best Speed), 11% for ZSTD_NO_DICT and 16% for ZSTD and 29% for ZLIB (Best Compression)

backslasht commented 9 months ago

Thanks @sarthakaggarwal97. Can you please create a separate issue to track the performance of Zstd decompress?

backslasht commented 9 months ago

@nknize - Given that we have decided to move zstd codec as a plugin in next version (2.10) and we already have fixes for the memory leak opensearch-project/OpenSearch#9403 and upgraded zstd to 1.5.5-5 opensearch-project/OpenSearch#9431, do you still think that zstd needs to be marked as experimental?

sarthakaggarwal97 commented 9 months ago

@backslasht added opensearch-project/OpenSearch#9456 to track performance of Zstd decompress

nknize commented 9 months ago

...do you still think that zstd needs to be marked as experimental?

Yes, even moving to a plugin I think we should start with an experimental flag. It's easy to remove experimental, it's hard to revert (as this situation has shown). I think we should keep it experimental until either 1. there's more evidence the JNI is reliable and won't melt a node (this includes determining whether we should add an expert setting to select direct/indirect ctor) , or 2. the AirCompressor pure java implementation proves to be reasonably competitive.

Re 2: I think we could potentially contribute to that project and explore SIMD optimizations. This would have two major benefits, 1: we get a pure java zstd for OpenSearch that would move back to a module (since performance looks promising), 2. we could contribute this upstream to Lucene and not have to carry a custom ZStdCodec implementation in OpenSearch.

With all of this uncertainty, we absolutely should keep experimental for a while.

nknize commented 9 months ago

Updating this issue from some recent slack discussion. Per the issue description, we do not plan to demote this to a plugin in a 2.9.1 patch release. We need to feature flag the zstd and zstd_no_dict codec options (default disabled) in a 2.9.1 patch, annotate the options with @ExperimentalApi, and continue the "demotion to plugin" discussion for 2.10.0 minor release. This keeps the 2.9.1 change minimal while protecting users from inadvertently using zstd (explicit opt in required using the feature flag) and buys us runway to decide whether this remains a module or moved to a plugin in a followup minor release. I'm happy to open a PR to feature flag it (per our typical RTC PR approval process)... but I don't want to take that contribution opportunity from anyone else, so here's a call to the community to pick up that PR and earn some merit!

bbarani commented 9 months ago

Release issue for 2.9.1 version - https://github.com/opensearch-project/opensearch-build/issues/3820 CC: @peterzhuamazon

dblock commented 9 months ago

Some thoughts on native dependencies.

First, AFAIK the rules discussed above are not spelled out anywhere. When zstd came along we said yes to the native dependencies in those PRs. @nknize, since you have strong opinions about the topic, maybe add them to the developer guide so we can all have some kind of agreement on what is ok and what is not ok? I don't see why native plugin would be ok, but native module would not, for example. (I agree that there's portability risk, however I don't know any specific examples where it's an actual problem.)

I'd add that since Lucene is a hard "no" on native components, in general accepting native code in OpenSearch creates a lot of opportunities for including code such as this one that has no other place to go in the ecosystem, but can improve performance, which means more happy users.

Then, we've had native libraries in plugins that are included by default in the OpenSearch distribution since the beginning (k-nn is the obvious example). And while we distribute the -min version, the product that most users download is the one with those plugins. It's also the product being run through batteries of tests. Finally, there's plenty of discussions about merging some of those plugins, including k-nn, into core, so we should have guidelines on native libraries before we have those discussions.

nknize commented 9 months ago

...since you have strong opinions about the topic, maybe add them to the developer guide...

+1. I agree we should probably codify this to prevent future traps.

...I agree that there's portability risk, however I don't know any specific examples where it's an actual problem

Two examples are provided in the description where linking native code as an OpenSearch module (which are installed by default) would've broken the min distribution. This may not always be the case though, Project Panama's improvements on hotspot interaction with native libraries is coming along and may provide enough improvements in JDK 22+ to reconsider this stance. In the meantime we should bias toward risk aversion by using plugins.

...in general accepting native code in OpenSearch creates a lot of opportunities

Absolutely! Having them as optional plugins prevents unexpected LDD or runtime behaviors beyond just linking failures (e.g., performance regressions due to native .so differences across platforms , same portability reasons Elastic still uses their own JNA).

...we've had native libraries in plugins that are included by default in the OpenSearch distribution since the beginning (k-nn is the obvious example).

These opensearch plugins are also still optional plugins (not installed by default) not modules.

...It's also the product being run through batteries of tests.

Progress is made but there are still many gaps, e.g., we were unaware of incompatibilities installing KNN Engine w/ CCR Engine until I saw it during an unrelated issue and even still we only hacked a patch by creating a CodecService extension point. You still can't create more than one engine and this doesn't seem to be "battle tested" anywhere. There's also a large handful of plugins still not using qa testing for mixed cluster or comprehensive bwc testing. There are a lot of opportunities to improve here which is why we need to bias toward plugins and not modules until we can get broader integration and cluster test coverage.

...discussions about merging some of those plugins, including k-nn, into core,

Those discussions consist of leveraging Lucene pure java HNSW implementation as the default in core and keeping the FAISS and nmslib native builds in the bundle plugin as an optional "performance boost". Tradeoffs of using native code should not be taken lightly

wbeckler commented 9 months ago

Maybe we can consider a roll-forward option that is lower risk in some ways and contains fewer changes in the patch:

Bump the zstd code to address the memory leak
Do nothing else

The effect of this approach is that a user who is benefitting from zstd will not be ambushed by a disable-via-feature-flag in a patch.

It is also a tie-ourselves-to-the-mast approach to introducing a superior compression algorithm. We have testing in place to ensure that it works, and while there were holes this time around, that doesn't mean we are incapable of introducing this feature, which is around a 15% throughput improvement for some workloads. It may be work to maintain native code across platforms, but it's work worth doing for anyone who benefits from that kind of speedup.

backslasht commented 9 months ago

Thanks @wbeckler for bringing up very valid points. I like the idea of not moving ZSTD behind the experimental flag but release a patch version with the fixes in.

For ZSTD as a codec, exhaustive testing was already done as part of opensearch-project/OpenSearch#7908.
After the fix in opensearch-project/OpenSearch#9403, there is clear evidence that there is no further memory leak, details in https://github.com/opensearch-project/OpenSearch/pull/9403#issuecomment-1684037170

Considering the above points, I don't see a strong reason to move it to experimental as part of the patch release.

dblock commented 9 months ago

Per semver AFAIK we cannot be adding/removing a feature in the patch release as @bbarani points out, we can only make security fixes. It's important we earn users trust with semver consistently. Occasionally we do ship critical bug fixes in a patch release, so the memory leak fix may qualify. I agree with @backslasht and @wbeckler and propose we "do nothing else" and close this issue.

If we strongly believe that we should have never shipped a feature that relies on a native library (I don't agree, but open to discussing and changing my mind in a separate PR that spells it out), we can make any breaking change we want in 3.0.

andrross commented 9 months ago

I agree that the reliance on a native lib is not a good reason to revert the GA status of the feature. If we get consensus that having a native lib in a module is bad, then I think we should live with it until the 3.0 release for the reasons cited above.

To revert the GA status of the feature, I think the criteria needs to be that we think the current zstd feature is unsuitable for production use even after the two fixes (#9403 and opensearch-project/OpenSearch#9431) are released in a patch. I'm not convinced that is the case so I'm currently in favor of not moving ZSTD behind an experimental flag in the patch.

shwetathareja commented 9 months ago

+1 We shouldn't move to experimental unless we are aware of any known issues. It should be good for production usage with zstd version upgrade and memory leak issue fix. Moving it to plugin in 2.x would break existing users. It should be done in 3.0.

backslasht commented 9 months ago

From the conversations above, I see there is consensus to keep the zstd codec in the core for 2.x versions and revisit via a separate issue for 3.0. Please call out here with valid justification if you think otherwise.

nknize commented 9 months ago

...here is consensus to keep the zstd codec in the core for 2.x versions and revisit via a separate issue for 3.0.

Strong -1 (veto).

This change is codifying a Lucene codec in a minor release that Lucene themselves have rejected upstream for valid technical reasons I won't copy/paste here. OpenSearch should similarly hold a higher quality bar here and do the opposite as to what is being suggested: discuss codifying this native codec in 3.0 and make it optional in 2.x From a pure technical perspective, we have not quantified the native memory impact this codec carries nor do we have adequate circuit breakers to trip and prevent the OS from killing the node if/when the machine's direct memory is exhausted due to off heap buffer allocation. Additionally this was an accidental premature promotion of a feature to GA in a minor. It does not fall in the definition of Semver, in fact the opposite: it introduces an untested feature that affects on-disk representation that cannot be backed out in a rolling upgrade. It should be treated similar to a CVE and we should hot patch in 2.9.1 and reassess for 3.0 Choosing to keep this as a top level feature is not only reckless, it sets a precedent that top level mandatory core features can introduce native library dependencies with little to no performance analysis or bar raising. We owe our OpenSearch users better than that.

I'm not sure many, if any, non-AWSers are tracking here (likely because that top of funnel is still not there for reasons like this) but I'd like to hear from them on this issue. Seems I'm in the minority of the Amazonians when it comes to the problems introduced when shipping a core w/ a mandatory native library integration. Interestingly enough, I'm also one of the only ones that personally went through this on Elasticsearch and it's not something I desire to have happen to OpenSearch. Lucene doesn't prohibit native library linking because they feel like it and want to make things more difficult. There is a long tail technical history of direct memory allocation killing nodes as I've described here.

Update:

...that doesn't mean we are incapable of introducing this feature, which is around a 15% throughput improvement for some workloads. It may be work to maintain native code across platforms, but it's work worth doing for anyone who benefits from that kind of speedup.

The intentions here are good, however I'll point out that moving to an optional plugin is still releasing the feature... but with the added benefit that users HAVE to opt-in. What happens if a downstream takes a hard dependency on ZStd because they see it's a module installed by default? Then there's a libc that breaks their plugin and they don't know why and can't roll back? Moving as an optional plugin doesn't stop them from taking a dependency, but it will fail if they don't explicitly install the plugin. That's a good guard rail to have for something as trappy as this.

reta commented 9 months ago

Additionally this was an accidental premature promotion of a feature to GA in a minor. It does not fall in the definition of Semver, in fact the opposite: it introduces an untested feature that affects on-disk representation that cannot be backed out in a rolling upgrade.

As non-AWS guy, that's to me is the core of the the problem. We should not have been promoted this feature from sandbox to GA (with zero community feedback). I do share the guilt and feel not good that we have to take a few steps back but we should fix the hiccup while it is not too late.

dblock commented 9 months ago

Strong -1 (veto).

@nknize We've never had a veto-type discussion in this project since the fork. So I am not sure how we move forward.

The rollback problem is a real one, open an issue (also how does it work for other new codecs?). On breaking semver, two wrongs don't make a right, so I don't agree it's worth doing another wrong for an issue that is not a security problem (the comparison to a CVE is baseless).

We should not have been promoted this feature from sandbox to GA (with zero community feedback).

@reta Maybe, but we did. We can call it a mistake, but it's too late, because now we would have to break semver. I don't understand how we can claim it doesn't when a user that enabled this feature cannot upgrade without breaking their system.

Finally, I don't agree that arguments in https://issues.apache.org/jira/browse/LUCENE-8739 apply here at all, because the majority of users download the complete distribution of OpenSearch that includes plugins, and many of those plugins include native code (e.g. k-nn). AFAIK that has had no serious consequences. I think allowing native code in core is an advantage, and the core engine is nothing special vs. these "default" plugins.

nknize commented 9 months ago

We've never had a veto-type discussion in this project since the fork. So I am not sure how we move forward.

We use them on nominations. We also use them in our Review then Commit policy on PRs where we require at least one approval and no changes / unresolved conversations (blocking vetos). The ASF has a good definition of Review then Commit I think we should lean into.

The rollback problem is a real one, open an issue

This is the rollback issue. Do you mean open a PR?

..also how does it work for other new codecs?

Depends on if there is a native dependency.

On breaking semver, two wrongs don't make a right, so I don't agree it's worth doing another wrong...

Guard rails to protect a users node from dying doesn't seem like a wrong? Maybe that's just me. I prefer to bias toward opt in if there's a chance a node could be killed due to unbounded native mallocs..

...it's too late, because now we would have to break semver.

Semver (which, as discussed many times before, we don't even enforce) isn't a hill worth dying on when it comes to the probability of real cluster crashes.. we need to consider what our priorities are.

...that includes plugins, and many of those plugins include native code

Again.. even those plugin.. are not installed by default so the guard rails are up and OpenSearch users have to explicitly opt in.

dblock commented 9 months ago

Again.. even those plugin.. are not installed by default so the guard rails are up and OpenSearch users have to explicitly opt in.

The k-nn plugin is installed by default when you download it from either the distribution on opensearch.org or docker, try this:

docker pull opensearchproject/opensearch:latest
docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:latest

[2023-08-28T22:03:01,705][INFO ][o.o.p.PluginsService     ] [08af92867f17] loaded module [geo]
...
[2023-08-28T22:03:01,707][INFO ][o.o.p.PluginsService     ] [08af92867f17] loaded plugin [opensearch-knn]

dblock commented 9 months ago

We use them on nominations. We also use them in our Review then Commit policy on PRs where we require at least one approval and no changes / unresolved conversations (blocking vetos). The ASF has a good definition of Review then Commit I think we should lean into.

If it's not obvious yet, I have a strong (veto, -1) opinion about breaking semver. We absolutely enforce it with functionality, external APIs (REST), and more. So this poses a certain dilemma about how to decide, no?

nknize commented 9 months ago

We absolutely enforce it with functionality, external APIs (REST), and more. Where did we not?

We try to support it with DSL (as long as a developer remembers to put .onOrAfter() and we get it as part of Lucene's index file metadata checks. Outside that we have little to no support for semver outside maintainers reviewing PRs for any violations. That's clearly not working as this wouldn't have happened. But it did and because it slipped through we need to hot patch.

reta commented 9 months ago

@reta Maybe, but we did. We can call it a mistake, but it's too late, because now we would have to break semver. I don't understand how we can claim it doesn't when a user that enabled this feature cannot upgrade without breaking their system.

We do break semver and I feel bad about it. Why I think it is not too late - we made just one release with ZSTD, and yes, it will affect some users but at the end will reward more users: the feature will be promoted to GA when it is at least well tested and verified. Just my few cents, but I share your sentiment - I don't like that we have to do that but there are possible risks we should better be ready for.

andrross commented 9 months ago

...maintainers reviewing PRs for any violations. That's clearly not working as this wouldn't have happened. But it did and because it slipped through we need to hot patch.

I don't think this was a case of accidental promotion and I don't think the promotion itself was a violation of semver, as we do have precedent for launching new opt-in features in minor versions that previous versions would not be able to handle.

However, we have absolutely identified gaps since the launch of this feature. Some have been mitigated, but to summarize the outstanding technical issues from the comments above:

we have not quantified the native memory impact this codec carries
we do not have adequate circuit breakers to trip and prevent the OS from killing the node if/when the machine's direct memory is exhausted due to off heap buffer allocation

That brings me back to my question: is this feature safe for production usage after the patch? @nknize obviously feels strongly that the answer is no, and I think @reta is in the "no" camp as well. @backslasht / @shwetathareja do you have any counter points to mitigate these concerns?

nknize commented 9 months ago

I don't think this was a case of accidental promotion.

The promotion in and of itself is not a violation of semver. The decision to promote based on long tail semver requirements of deprecation and removal absolutely would have been raised and prevented the premature promotion. Now we're in the position of hot patching to demote. Which is not uncharted territory in the software world, nor is it that complicated. So I don't understand the position of using semver as an argument to carry a trappy untested feature that never should've been promoted in the first place. It was a mistake to promote this untested feature, mistakes happen. Hot patch to fix the mistake and let users that care opt in and move on. I'm amazed it's this controversial.

nknize commented 9 months ago

One option to consider to prevent the OS from force killing the jvm due to direct memory oom would be to add a default -XX:MaxDirectMemorySize= at startup. What default is chosen matters (e.g., at elastic we chose 0.5 * max heap which was recommended no more than 50% total machine memory. this caused some user bug reports when it was first introduced, however). The side effect of that would be increased page cache misses on MMapDirectory which is now in the critical path for any future lucene files not explicitly set to use NIOFs. Paper cuts add up...

backslasht commented 9 months ago

I agree with @andrross, this was not accidentally promoted. There was no guideline on the native usage within the core when the original PR was approved and subsequently released in 2.9. As a community, I'm ready to accept new restrictions (no native usage in core), provided we have a discussion around that.

Taking a step back, there are at least two different things being discussed here.

Moving the codec to a plugin.
Calling it as experimental.

On # 1, I agree with @dblock that this breaks the semver and we shouldn't do it unless we strongly feel no native code in core. On # 2, There are valid concerns on memory leak which is fixed now. Post which, there are efforts in opensearch-project/OpenSearch#9502 to measure the leak which is still in progress. I think we should wait for it to complete and call it experimental only if we still see memory related issues. If you think we need to do more tests, lets add it to opensearch-project/OpenSearch#9502.

shwetathareja commented 9 months ago

Working backward from users, the ZSTD codec is not enabled by default and is still optional for users to enable it per index. And in case a user is going to try out, at that point it doesn't matter whether the code was distributed as core/ plugin. It will run in the same process. What is important (others also called out) is How do we ensure it is production ready? Why do you think we will not have same conversation in 3.0 in case we mark it experimental now. We should run long running tests with different workloads to catch any memory leak issues. Also, brainstorm what kind of testing we should do that provides us the sufficient confidence.

shamil commented 9 months ago

My 2 cents as an opensearch user that run multiple clusters in production. It's surprising to hear that this feature was promoted accidentally since there was blog post where zstd was clearly highlighted. To me it seems like this feature will benefit many users as it both storage saving and performance improvement feature. We enabled zstd codec as soon as we upgraded to 2.9, although we revered the codec as it caused issues for us due to memory leak, but we are still looking forward to use it once a patch version released.

mikemccand commented 9 months ago

Hindsight is of course 20/20, but....:

Unfortunately this code was prematurely released GA when it was migrated as a top level module a short time later in opensearch-project/OpenSearch#7908 without any feature flags.

Besides discussions about native code in core, and semver (which we already don't succeed in testing/achieving/enforcing today?) somehow weirdly being able to block fixing a clear mistake, this seems quite reckless, to make such a major index-format impacting, native code wrapping, change so shortly after the feature had birthed into Sandbox, without gating it with a feature flag, and marking the Codec as experimental?

We should step back and understand why this happened and how to be more careful/explicit in the future. Things start and live and mature in Sandbox for a good reason: serious issues like memory leaks and severe bugs get ferreted out there. When things get promoted out of Sandbox, they are marked experimental and gated with opt-in feature flags. When they are adopted and loved by users, and there are not serious issues, the experimental caveat can be removed and the feature lives on.

It also sounds like we have poor testing around Zstd. Why didn't our tests find these issues? Does OpenSearch randomly rotate all Codecs in when running tests, like Lucene's tests? We shouldn't promote something so fundamental as Zstd out of Sandbox without strong test coverage. That needs to be addressed too? What other issues are lurking with this Zstd implementation? We cannot claim this is production ready if we have not tested it properly. It is still experimental and we should tell our users so.

What about backwards compatibility? Does OpenSearch guarantee this? The release policy description does not seem to say one way or another about the search index, or perhaps implies that major releases can break searching of indices from the last major release? But if so, how is a rolling upgrade done ... if a user builds a Zstd compressed index today, can they continue to search it in future major OpenSearch releases? For how long? (Lucene has strict policies here... the index written by the default Codec will still be readable in the next major release series, but not beyond).

Clearly Zstd is an important feature! Better and faster compression. Of course all users would choose this :) I hope Lucene will adopt it by default, eventually. But it's clearly not there yet, and OpenSearch users should know they are explicitly opting into an experimental feature in the meantime.

Anyway, in addition to fixing the root cause here ("why did we so recklessly promote a feature out of Sandbox w/o proper testing, gating with a feature flag, marking experimental"), +1 to @nknize's opening proposal -- gate Zstd behind an opt-in feature flag, make it clear that it's still experimental, no back compat promise, etc. in a quick bug fix release. With such severe issues (memory leaks, exceptions) users cannot possibly succeed in using this in production anyways, so the semver argument against correcting this mistake makes zero sense to me.

opensearch-project / OpenSearch

[Demote] ZStd compression from GA / LTS to experimental and release a 2.9.1 patch #9422