Closed peternied closed 2 years ago
@peternied taking a look at this data it looks like we have significant degradation in index & query latencies across the board. Index requests per second is down ~10%, mean Index latencies are up 11% (p50 9%, p90 11%, p99 37%, p100 46%), and mean query latency is up 116% (p50 up 116%, p90-108%, p99 106%, p100 118%)
Looking to see what went on here but it looks like the bundle for 1.1 contains different components than 1.2, can we get a run against the same components that were included with the 1.1 bundle to get a baseline?
Working with Marc to investigate, OpenSearch-1-2-0--982-x64-disable
going to inspect this host configuration and see if we can pull slow logs or see other issues
We are going to deploy min versions for testing, using the 1.1.0 release min version, https://ci.opensearch.org/ci/dbc/builds/1.1.0/405/x64/bundle/opensearch-min-1.1.0-linux-x64.tar.gz
Started a CDK deployment for the min version of 1.2.0 following the performance setup process, as soon as that is done will repeat for 1.2.0.
Note we are using build 982
to make sure we have side-by-side comparisons with the other tests results.
Working on getting rally set up to run min comparison locally will report back
New environment is up, starting on the 1.1.0
OpenSearch-1-2-0--982-ROUND2-x64-min.LogGroupName = OpenSearch-1-2-0--982-ROUND2-x64-min-11-17-21-t20-13
OpenSearch-1-2-0--982-ROUND2-x64-min.PrivateIp = 172.XXX.XXX.136
Both environments are now up
OpenSearch-1-1-0--405-ROUND2-x64-min.LogGroupName = OpenSearch-1-1-0--405-ROUND2-x64-min-11-17-21-t20-22
OpenSearch-1-1-0--405-ROUND2-x64-min.PrivateIp = 172.XXX.XXX.132
OpenSearch-1-2-0--982-x64-disable
Investigation notes:TLDR; Nothing looks out of place
Checking the logs
sh-4.2$ ls -al
total 229760
drwxr-xr-x 2 ec2-user ec2-user 4096 Nov 17 00:01 .
drwxr-xr-x 11 ec2-user ec2-user 265 Nov 11 22:56 ..
-rw-r--r-- 1 ec2-user ec2-user 4228 Nov 12 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz
-rw-r--r-- 1 ec2-user ec2-user 3972 Nov 12 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz
-rw-r--r-- 1 ec2-user ec2-user 3202 Nov 13 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz
-rw-r--r-- 1 ec2-user ec2-user 2569 Nov 13 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz
-rw-r--r-- 1 ec2-user ec2-user 2458 Nov 14 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-13-1.json.gz
-rw-r--r-- 1 ec2-user ec2-user 1855 Nov 14 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-13-1.log.gz
-rw-r--r-- 1 ec2-user ec2-user 2474 Nov 15 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-14-1.json.gz
-rw-r--r-- 1 ec2-user ec2-user 1887 Nov 15 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-14-1.log.gz
-rw-r--r-- 1 ec2-user ec2-user 2457 Nov 16 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-15-1.json.gz
-rw-r--r-- 1 ec2-user ec2-user 1848 Nov 16 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-15-1.log.gz
-rw-r--r-- 1 ec2-user ec2-user 2423 Nov 17 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-16-1.json.gz
-rw-r--r-- 1 ec2-user ec2-user 1824 Nov 17 00:01 OpenSearch-1-2-0--982-x64-disable-2021-11-16-1.log.gz
-rw-r--r-- 1 ec2-user ec2-user 39422 Nov 17 20:31 OpenSearch-1-2-0--982-x64-disable.log
-rw-r--r-- 1 ec2-user ec2-user 463 Nov 11 23:27 OpenSearch-1-2-0--982-x64-disable_deprecation.json
-rw-r--r-- 1 ec2-user ec2-user 177944 Nov 12 07:44 OpenSearch-1-2-0--982-x64-disable_deprecation.log
-rw-r--r-- 1 ec2-user ec2-user 0 Nov 11 22:56 OpenSearch-1-2-0--982-x64-disable_index_indexing_slowlog.json
-rw-r--r-- 1 ec2-user ec2-user 0 Nov 11 22:56 OpenSearch-1-2-0--982-x64-disable_index_indexing_slowlog.log
-rw-r--r-- 1 ec2-user ec2-user 0 Nov 11 22:56 OpenSearch-1-2-0--982-x64-disable_index_search_slowlog.json
-rw-r--r-- 1 ec2-user ec2-user 0 Nov 11 22:56 OpenSearch-1-2-0--982-x64-disable_index_search_slowlog.log
-rw-r--r-- 1 ec2-user ec2-user 105978 Nov 17 20:31 OpenSearch-1-2-0--982-x64-disable_server.json
-rw-r--r-- 1 ec2-user ec2-user 33541769 Nov 17 15:44 gc.log
-rw-r--r-- 1 ec2-user ec2-user 2017 Nov 11 22:56 gc.log.00
-rw-r--r-- 1 ec2-user ec2-user 67108938 Nov 12 01:47 gc.log.01
-rw-r--r-- 1 ec2-user ec2-user 67108920 Nov 12 04:08 gc.log.02
-rw-r--r-- 1 ec2-user ec2-user 67108897 Nov 12 06:30 gc.log.03
Looking for any unexpected errors, warnings, or exceptions. Nothing stands out that was cause any issues
sh-4.2$ zgrep ERROR *
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz:{"type": "server", "timestamp": "2021-11-11T23:28:58,037Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz:[2021-11-11T23:28:58,037][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T01:07:23,159Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T02:46:18,645Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T04:25:58,910Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T06:05:24,176Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T01:07:23,159][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T02:46:18,645][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T04:25:58,910][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T06:05:24,176][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
sh-4.2$ zgrep WARN *
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz:{"type": "server", "timestamp": "2021-11-11T22:56:45,679Z", "level": "WARN", "component": "o.o.s.OpenSearchSecurityPlugin", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "OpenSearch Security plugin installed butdisabled. This can expose your configuration (including passwords) to the public." }
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz:{"type": "server", "timestamp": "2021-11-11T22:56:50,499Z", "level": "WARN", "component": "o.o.g.DanglingIndicesState", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually" }
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz:{"type": "server", "timestamp": "2021-11-11T22:56:51,096Z", "level": "WARN", "component": "o.o.b.BootstrapChecks", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "the default discovery settings are unsuitable forproduction use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured" }
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz:{"type": "server", "timestamp": "2021-11-11T22:56:54,337Z", "level": "WARN", "component": "o.o.p.c.s.h.ConfigOverridesClusterSettingHandler", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "Config override setting update called with empty string. Ignoring." }
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz:[2021-11-11T22:56:45,679][WARN ][o.o.s.OpenSearchSecurityPlugin] [ip-172-31-50-81.eu-west-1.compute.internal] OpenSearch Security plugin installed but disabled. This can expose your configuration (including passwords) to the public.
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz:[2021-11-11T22:56:50,499][WARN ][o.o.g.DanglingIndicesState] [ip-172-31-50-81.eu-west-1.compute.internal] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz:[2021-11-11T22:56:51,096][WARN ][o.o.b.BootstrapChecks ] [ip-172-31-50-81.eu-west-1.compute.internal] the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz:[2021-11-11T22:56:54,337][WARN ][o.o.p.c.s.h.ConfigOverridesClusterSettingHandler] [ip-172-31-50-81.eu-west-1.compute.internal] Config override setting update called with empty string. Ignoring.
sh-4.2$ zgrep xception *
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz:{"type": "server", "timestamp": "2021-11-11T22:56:44,236Z", "level": "INFO", "component": "o.o.n.Node", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "JVM arguments [-Xshare:auto, -Dopensearch.networkaddress.cache.ttl=60, -Dopensearch.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms1g, -Xmx1g, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/opensearch-7923930772569516639, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Dclk.tck=100, -Djdk.attach.allowAttachSelf=true, -Djava.security.policy=/opensearch-1.2.0/plugins/opensearch-performance-analyzer/pa_config/opensearch_security.policy, -XX:MaxDirectMemorySize=536870912, -Dopensearch.path.home=/opensearch-1.2.0, -Dopensearch.path.conf=/opensearch-1.2.0/config, -Dopensearch.distribution.type=tar, -Dopensearch.bundled_jdk=true]" }
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.json.gz:{"type": "server", "timestamp": "2021-11-11T23:28:58,037Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz:[2021-11-11T22:56:44,236][INFO ][o.o.n.Node ] [ip-172-31-50-81.eu-west-1.compute.internal] JVM arguments [-Xshare:auto, -Dopensearch.networkaddress.cache.ttl=60, -Dopensearch.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms1g, -Xmx1g, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/opensearch-7923930772569516639, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Dclk.tck=100, -Djdk.attach.allowAttachSelf=true, -Djava.security.policy=/opensearch-1.2.0/plugins/opensearch-performance-analyzer/pa_config/opensearch_security.policy, -XX:MaxDirectMemorySize=536870912, -Dopensearch.path.home=/opensearch-1.2.0, -Dopensearch.path.conf=/opensearch-1.2.0/config, -Dopensearch.distribution.type=tar, -Dopensearch.bundled_jdk=true]
OpenSearch-1-2-0--982-x64-disable-2021-11-11-1.log.gz:[2021-11-11T23:28:58,037][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T01:07:23,159Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T02:46:18,645Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T04:25:58,910Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.json.gz:{"type": "server", "timestamp": "2021-11-12T06:05:24,176Z", "level": "ERROR", "component": "o.o.i.i.ManagedIndexCoordinator", "cluster.name": "OpenSearch-1-2-0--982-x64-disable", "node.name": "ip-172-31-50-81.eu-west-1.compute.internal", "message": "get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]", "cluster.uuid": "MJAA_p5XQsGUjpt3WPrfRw", "node.id": "palPtzqkROCcgYM8yYfbfA" }
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T01:07:23,159][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T02:46:18,645][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T04:25:58,910][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable-2021-11-12-1.log.gz:[2021-11-12T06:05:24,176][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-50-81.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--982-x64-disable
Investigation notes continued:Checking the installed plugins
sh-4.2$ sudo ./opensearch-plugin list
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-sql
Disabling observability for testing as its new for 1.2.0
sh-4.2$ sudo ./opensearch-plugin remove opensearch-observability
-> removing [opensearch-observability]...
-> preserving plugin config files [/opensearch-1.2.0/config/opensearch-observability] in case of upgrade; use --purge if not needed
Will kick off tests against this instance as well
Additional data from our internal performance data service:
While default query / distance_amount_agg were marginally impacted, index too a bigger hit and range, autohisto_agg, date_histogram_agg were double or worse
b61240fa-52f6-4dd9-819c-4ac432f69b6a
f76c89f3-24ce-4c30-a285-209b750f3d68
After deploying the min instances, they needed to be manually started, that has been done.
Started up the percolator
test case to get data more quickly against these four configurations: 1.1.0-min, 1.2.0-min, 1.2.0-No obserability, 1.2.0-plain
'testExecutionId': 'ccb1f045-ccc6-4251-a717-fcdd14fc6878'
'testExecutionId': 'f3c09830-7e2a-44f0-a678-4d4674ed804e'
'testExecutionId': '60b83bee-30c9-46ac-8368-3a1605846269'
'testExecutionId': '9b3863dc-cc50-4c22-be1c-198d88c088cc'
While waiting for results, saw a difference between the number of Queued tasks
'testExecutionId': 'ccb1f045-ccc6-4251-a717-fcdd14fc6878' - 1.2 - MIN
'testExecutionId': 'f3c09830-7e2a-44f0-a678-4d4674ed804e' - 1.2 - bundle without observability plugin. To make same list of installed plugins as 1.1 bundle.
'testExecutionId': '60b83bee-30c9-46ac-8368-3a1605846269' - 1.2 - full bundle.
'testExecutionId': '9b3863dc-cc50-4c22-be1c-198d88c088cc' - 1.1 - MIN
Details on how queries that regressed are executed, using nyc_taxis dataset executed by ESRally client.
Range query from 1.1 vs 1.2 showed >100% increase in latency (https://github.com/opensearch-project/OpenSearch/issues/1560#issuecomment-972099825)
Operation info from the test
"name": "range",
"operation-type": "search",
"body": {
"query": {
"range": {
"total_amount": {
"gte": 5,
"lt": 15
}
}
}
}
},
The percolator results are inconclusive. We are going to run additional tests overnight with the nyc_taxis dataset and the same configurations.
1.1 min distro
1.2 min distro
1.2 with same bundle configuration as 1.1.
1.2 full bundle
The regression appears for both architectures (arm/x64) and with/without security enabled, so for these tests we will pick a configuration with security enabled and use x64.
We are also going to run against older builds of 1.2 between Sept 14th (around when 1.x was bumped to 1.2) and the latest on Nov 17th to try and identify when this could have been introduced.
#329 October 1st - Changes between Sep 14th (1.2 version bump) and Oct 1st.
Location: https://ci.opensearch.org/ci/dbc/bundles/1.2.0/329/x64/opensearch-1.2.0-linux-x64.tar.gz
#578 October 12th - Indexing back pressure changes introduced
Location: https://ci.opensearch.org/ci/dbc/bundles/1.2.0/578/x64/opensearch-1.2.0-linux-x64.tar.gz - contains alerting & job-scheduler
#721 October 22nd - Before Lucene upgrade to 8.10.1
Location: https://ci.opensearch.org/ci/dbc/bundles/1.2.0/721/x64/opensearch-1.2.0-linux-x64.tar.gz - alerting, job-scheduler, knn
#809 October 30th - After Lucene upgrade to 8.10.1.
Location: https://ci.opensearch.org/ci/dbc/bundle-build/1.2.0/809/linux/x64/dist/opensearch-1.2.0-linux-x64.tar.gz - alerting, job-scheduler, knn
#868 Nov 3rd - Core feature freeze
Location: https://ci.opensearch.org/ci/dbc/bundle-build/1.2.0/868/linux/x64/dist/opensearch-1.2.0-linux-x64.tar.gz - alerting, job-scheduler, knn, security
#957 Nov 10th - Changes made after component feature freeze.
Location: https://ci.opensearch.org/ci/dbc/bundle-build/1.2.0/957/linux/x64/dist/opensearch-1.2.0-linux-x64.tar.gz - full bundle
#1052 Nov 17th - Full 1.2 - includes the latest commit to 1.2 on the 16th.
Location: https://ci.opensearch.org/ci/dbc/bundle-build/1.2.0/1052/linux/x64/dist/opensearch-1.2.0-linux-x64.tar.gz - full bundle
I compared 1.1 and 1.2 performance runs and on queries being slow on 1.2: performance test created nyc-taxis
index with 5 shards on 1.1 and with 1 shard on 1.2 test runs respectively. This index has ~165millon docs. So on 1.1 with 5 shards (4.5 GB each) the query ran faster than 1.2 with single (23GB) shard. We need to rerun the 1.2 performance test with 5 shards and compare again.
Excellent work @skkosuri-amzn !
I have reviewed the data from the overnight runs that I scheduled and all of them show the same performance characteristics with query latency which agrees with your findings.
Since we have both the older ec2 instance around, and the new ones, I have scheduled shorter running nyc_taxis tests on our ` and
OpenSearch-1-2-0--ROUND3-1052-x64-enable/Instance`.
perf-x64-1-1-0-with-security
OpenSearch-1-2-0--ROUND3-1052-x64-enable
I can also confirm that I see a single shard on both tests that are inflight.
sh-4.2$ curl localhost:9200/_cat/shards
security-auditlog-2021.10.01 0 p STARTED 44 108.1kb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
security-auditlog-2021.10.01 0 r UNASSIGNED
security-auditlog-2021.10.05 0 p STARTED 3 38.1kb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
security-auditlog-2021.10.05 0 r UNASSIGNED
security-auditlog-2021.11.18 0 p STARTED 7 98.1kb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
security-auditlog-2021.11.18 0 r UNASSIGNED
nyc_taxis 2 p STARTED 33069515 4.5gb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
nyc_taxis 1 p STARTED 33065507 4.5gb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
nyc_taxis 4 p STARTED 33067528 4.5gb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
nyc_taxis 3 p STARTED 33072786 4.5gb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
nyc_taxis 0 p STARTED 33071356 4.5gb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
security-auditlog-2021.09.30 0 p STARTED 79 190.3kb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
security-auditlog-2021.09.30 0 r UNASSIGNED
.opendistro_security 0 p STARTED 9 60.1kb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
cloudtrail 0 p STARTED 1 22.9kb 127.0.0.1 ip-172-31-60-105.eu-west-1.compute.internal
cloudtrail 0 r UNASSIGNED
sh-4.2$ curl -k -u admin:admin https://localhost:443/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open security-auditlog-2021.10.01 1wpuHB6XS1eGbMUwq3GlVQ 1 1 44 0 108.1kb 108.1kb
yellow open security-auditlog-2021.09.30 jy9zZAxTRgav5SvCrWpkGw 1 1 79 0 190.3kb 190.3kb
green open nyc_taxis jyd6SVw0RZ-kFEHvshV--Q 1 0 4351051 0 718mb 718mb
yellow open security-auditlog-2021.10.05 nUliHhxdQtezIgsGTa-8NQ 1 1 3 0 38.1kb 38.1kb
yellow open cloudtrail _Mmhk8IZSWuu0pRsFb_0pQ 1 1 1 0 22.9kb 22.9kb
yellow open security-auditlog-2021.11.18 pzTM2CGIRy2cit7XGW2DnQ 1 1 31 0 195.6kb 195.6kb
green open .opendistro_security XNl23i-eSbCBMHDgQA-wag 1 0 9 0 60.1kb 60.1kb
sh-4.2$ curl -k -u admin:admin https://localhost:443/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open nyc_taxis 3EkmG0s8R16xgAmqUi_e0A 1 1 4629113 0 1.5gb 1.5gb
green open .opendistro_security SO-LphAsTB29a6UucydozA 1 0 9 0 60kb 60kb
yellow open security-auditlog-2021.11.18 E1imAeDrT0iFrZeFg5h2GQ 1 1 90 0 451.9kb 451.9kb
sh-4.2$ curl -k -u admin:admin https://localhost:443/_cat/shards?v
index shard prirep state docs store ip node
.opendistro_security 0 p STARTED 9 60kb 127.0.0.1 ip-172-31-59-78.eu-west-1.compute.internal
nyc_taxis 0 p STARTED 7374873 2.3gb 127.0.0.1 ip-172-31-59-78.eu-west-1.compute.internal
nyc_taxis 0 r UNASSIGNED
security-auditlog-2021.11.18 0 p STARTED 90 257.5kb 127.0.0.1 ip-172-31-59-78.eu-west-1.compute.internal
security-auditlog-2021.11.18 0 r UNASSIGNED
Tests on the OpenSearch-1-2-0--ROUND3-1052-x64-enable/Instance
are not viable because the nyc_taxis index was not properly created at test start:
...
2021-11-18 12:42:53,414 -not-actor-/PID:6118 elasticsearch WARNING PUT https://172.31.59.78:443/nyc_taxis [status:400 request:0.240s]
...
Nothing useful on the host logs that I can see
sh-4.2$ grep ERROR *.log
OpenSearch-1-2-0--ROUND3-1052-x64-enable.log:[2021-11-18T04:07:30,160][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-59-78.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--ROUND3-1052-x64-enable.log:[2021-11-18T05:55:11,819][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-59-78.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--ROUND3-1052-x64-enable.log:[2021-11-18T07:42:12,513][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-59-78.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--ROUND3-1052-x64-enable.log:[2021-11-18T09:28:48,305][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-59-78.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--ROUND3-1052-x64-enable.log:[2021-11-18T11:16:29,195][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-59-78.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--ROUND3-1052-x64-enable.log:[2021-11-18T12:42:52,739][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-59-78.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
OpenSearch-1-2-0--ROUND3-1052-x64-enable.log:[2021-11-18T12:43:59,577][ERROR][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-59-78.eu-west-1.compute.internal] get managed-index failed: [.opendistro-ism-config] IndexNotFoundException[no such index [.opendistro-ism-config]]
Looking at the settings for the index it was auto generated and the dropoff_location
field was not created as a time index, making those results unusable
sh-4.2$ curl -k -u admin:admin https://localhost:443/nyc_taxis
{"nyc_taxis":{"aliases":{},"mappings":{"properties":{"dropoff_datetime":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"dropoff_location":{"type":"float"},"extra":{"type":"float"},"fare_amount":{"type":"float"},"improvement_surcharge":{"type":"float"},"mta_tax":{"type":"float"},"passenger_count":{"type":"long"},"payment_type":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pickup_datetime":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pickup_location":{"type":"float"},"rate_code_id":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"store_and_fwd_flag":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"tip_amount":{"type":"float"},"tolls_amount":{"type":"float"},"total_amount":{"type":"float"},"trip_distance":{"type":"float"},"trip_type":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"vendor_id":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"settings":{"index":{"creation_date":"1637239439791","number_of_shards":"1","number_of_replicas":"1","uuid":"3EkmG0s8R16xgAmqUi_e0A","version":{"created":"135237827"},"provided_name":"nyc_taxis"}}}}
While the test methodology is slightly different, here is a comparison between 1.1 (one shard) vs the original results from 1.2 (one shard).
range, autohisto_agg, date_histogram_agg are all significantly closer, +100% variance has dialed down to +-2% for P50 latencies.
indexing latency is heavily skewed towards the 1.1 results, coming a whole second faster in both P50/90
In order to have a basis of comparison for 1.1 and rule out test configuration differences, I've triggered 4 new tests against the OpenSearch 1.1, the should be complete in ~8 hours
Some quick observations:
In IndexingPressureService.java#L29:
public Releasable markCoordinatingOperationStarted(long bytes, boolean forceExecution) {
if (isShardIndexingPressureEnabled() == false) {
return shardIndexingPressure.markCoordinatingOperationStarted(bytes, forceExecution);
} else {
return () -> {};
}
}
This is called from TransportBulkAction.java#L215:
final Releasable releasable = indexingPressureService.markCoordinatingOperationStarted(indexingBytes, isOnlySystem);
Which does some seemingly innocuous atomic operations per shard (but could nevertheless amortize):
long combinedBytes = this.currentCombinedCoordinatingAndPrimaryBytes.addAndGet(bytes);
long replicaWriteBytes = this.currentReplicaBytes.get();
long totalBytes = combinedBytes + replicaWriteBytes;
...
Bulk indexing is the default indexing runner for nyc_taxi rally track:
def register_default_runners():
register_runner(track.OperationType.Bulk, BulkIndex(), async_runner=True)
...
Which means if ShardIndexingPressure is actually disabled by default... those markCoordinatingOperationsStarted Atomic calls are most certainly happening during rally bulk indexing.
We may only catch amortized performance degradation when trying to bulk index large volumes of data; which is exactly what the above benchmark does (~165M docs w/ ~1second increase in latency).
I think it would be worthwhile either:
if (isShardIndexingPressureEnabled() == false) {
to
if (isShardIndexingPressureEnabled() == true) {
/cc @getsaurabh02 @Bukhtawar
I ran the stackoverflow dataset against a single node for 1.2 min and 1.2 min without ShardIndexingPressure. Each of these ran 2 warmup iterations and 3 test iterations.
These are the high level results.
There is a ~20% drop in p50, p90, and p99 index latency with ShardIndexingPressure removed. There is also a 23% drop in p50 throughput.
Test id: 1f7ccb31-34a7-4699-85c0-5c88e3856ea8
Description: 1.2.0-min distribution
Latency:
p50: 3152.50
p90: 4515.80
p99: 6337.80
req/s
p0: 11948
p50: 12812
p100: 16081
Test id: 9e2b7f90-c599-4947-828b-0e19eff3069f
Description: without ShardIndexingPressure
Latency:
p50: 2584.60
p90: 3735.20
p99: 5237.50
req/s
p0: 14462.10
p50: 15744.60
p100: 19340.30
I also executed a single run against each node using esrally directly - the results are consistent with the above, baseline being 1.2 and contender being 1.2 with the revert.
Comparing baseline
Race ID: 7e783b5e-cab6-4c4f-8b52-72be910c6426
Race timestamp: 2021-11-19 00:38:58
Challenge: append-no-conflicts
Car: external
with contender
Race ID: 35e9f845-cda2-44be-9a40-4b1b339dd389
Race timestamp: 2021-11-19 02:57:39
Challenge: append-no-conflicts
Car: external
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Baseline | Contender | Diff | Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
| Cumulative indexing time of primary shards | | 191.977 | 145.938 | -46.0391 | min |
| Min cumulative indexing time across primary shard | | 37.5409 | 28.8728 | -8.66808 | min |
| Median cumulative indexing time across primary shard | | 38.5485 | 29.1799 | -9.3686 | min |
| Max cumulative indexing time across primary shard | | 39.0539 | 29.4679 | -9.586 | min |
| Cumulative indexing throttle time of primary shards | | 0 | 0 | 0 | min |
| Min cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min |
| Median cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min |
| Max cumulative indexing throttle time across primary shard | | 0 | 0 | 0 | min |
| Cumulative merge time of primary shards | | 353.153 | 287.85 | -65.3033 | min |
| Cumulative merge count of primary shards | | 232 | 228 | -4 | |
| Min cumulative merge time across primary shard | | 56.3481 | 46.1437 | -10.2044 | min |
| Median cumulative merge time across primary shard | | 75.5159 | 58.632 | -16.8838 | min |
| Max cumulative merge time across primary shard | | 77.3443 | 63.9771 | -13.3672 | min |
| Cumulative merge throttle time of primary shards | | 205.53 | 177.643 | -27.8873 | min |
| Min cumulative merge throttle time across primary shard | | 29.8777 | 27.4785 | -2.39927 | min |
| Median cumulative merge throttle time across primary shard | | 45.1533 | 35.7799 | -9.37343 | min |
| Max cumulative merge throttle time across primary shard | | 46.2698 | 42.3912 | -3.8786 | min |
| Cumulative refresh time of primary shards | | 50.9837 | 37.9007 | -13.083 | min |
| Cumulative refresh count of primary shards | | 1240 | 1264 | 24 | |
| Min cumulative refresh time across primary shard | | 9.91115 | 7.36747 | -2.54368 | min |
| Median cumulative refresh time across primary shard | | 10.0904 | 7.6075 | -2.4829 | min |
| Max cumulative refresh time across primary shard | | 10.6602 | 7.82272 | -2.83748 | min |
| Cumulative flush time of primary shards | | 4.78753 | 4.1041 | -0.68343 | min |
| Cumulative flush count of primary shards | | 78 | 78 | 0 | |
| Min cumulative flush time across primary shard | | 0.863233 | 0.7556 | -0.10763 | min |
| Median cumulative flush time across primary shard | | 0.987917 | 0.804417 | -0.1835 | min |
| Max cumulative flush time across primary shard | | 1.03317 | 0.893617 | -0.13955 | min |
| Total Young Gen GC time | | 93.217 | 66.863 | -26.354 | s |
| Total Young Gen GC count | | 7072 | 5754 | -1318 | |
| Total Old Gen GC time | | 0 | 13.529 | 13.529 | s |
| Total Old Gen GC count | | 0 | 394 | 394 | |
| Store size | | 30.6503 | 36.1271 | 5.47678 | GB |
| Translog size | | 2.56114e-07 | 2.56114e-07 | 0 | GB |
| Heap used for segments | | 0.359074 | 0.39061 | 0.03154 | MB |
| Heap used for doc values | | 0.0402946 | 0.048172 | 0.00788 | MB |
| Heap used for terms | | 0.220184 | 0.238037 | 0.01785 | MB |
| Heap used for norms | | 0.0135498 | 0.0146484 | 0.0011 | MB |
| Heap used for points | | 0 | 0 | 0 | MB |
| Heap used for stored fields | | 0.0850449 | 0.0897522 | 0.00471 | MB |
| Segment count | | 111 | 120 | 9 | |
| Min Throughput | index-append | 11660.2 | 14891.9 | 3231.61 | docs/s |
| Median Throughput | index-append | 12814.5 | 16208.2 | 3393.69 | docs/s |
| Max Throughput | index-append | 14703.4 | 17601.7 | 2898.36 | docs/s |
| 50th percentile latency | index-append | 3186.7 | 2505.53 | -681.167 | ms |
| 90th percentile latency | index-append | 4741.39 | 3545.78 | -1195.6 | ms |
| 99th percentile latency | index-append | 6612.43 | 4739.79 | -1872.64 | ms |
| 99.9th percentile latency | index-append | 9216.19 | 6385.44 | -2830.75 | ms |
| 100th percentile latency | index-append | 9760.23 | 7530.21 | -2230.01 | ms |
| 50th percentile service time | index-append | 3186.7 | 2505.53 | -681.167 | ms |
| 90th percentile service time | index-append | 4741.39 | 3545.78 | -1195.6 | ms |
| 99th percentile service time | index-append | 6612.43 | 4739.79 | -1872.64 | ms |
| 99.9th percentile service time | index-append | 9216.19 | 6385.44 | -2830.75 | ms |
| 100th percentile service time | index-append | 9760.23 | 7530.21 | -2230.01 | ms |
| error rate | index-append | 0 | 0 | 0 | % |
Hi @nknize, Thank you for sharing this detail. However the atomic reference updates what is being discussed above exists even in version 1.1 and are executed by default. This came in along with the Indexing Pressure changes for all bulk indexing operations. These were actually introduced with ES-7.9 release (are executed by default), and this is not specific change in OS 1.2 changes.
The reason it is restructured here and kept behind a dynamic setting (default false) now, is to ensure that until Shard Indexing Pressure is specifically enabled, Indexing Pressure feature continues to works as is.
It also ensures Shard Indexing Pressure does not add any form of execution overhead here as long as this setting is disable, which I see being called out here. Given the code execution logic has not changed, I don’t see this as the root cause here. We will need to identify other changes which could have impacted it, or re-run tests a few time to avoid any discrepancy such as cold starts.
Here are measurements back for the entire test matrix between 1.1 and 1.2 with the distribution, with a single shard.
I confidently say that the previous comparisons around query latency were overblown, while there is drift in performance, single shard vs multi-shard had a dramatic impact and the numbers are +-20% Clearest confidence indicates that these test should be run with a 5 shard setup, which we will schedule.
For 3 of 4 tests, OpenSearch 1.2 performance dropped, but for x64 with the security plugin enabled we saw an improvement. This is not conclusive, and I would recommend more measurement focused around the OpenSearch core scenarios were there could be more risk and a smaller delta.
These were actually introduced with ES-7.9 release (are executed by default), and this is not specific change in OS 1.2 changes.
Ah! Thanks for clarifying that @getsaurabh02 and that logic consistency makes sense (javadocs would help make this more clear).
Based on the latest benchmarking above we need to dig further into the source of the added indexing overhead. Those benchmarks w/ the PR revert are consistent across both data sets (nyc_taxi and stackOverflow) giving more plausibility in the regression introduced by the PR.
There is a ~20% drop in p50, p90, and p99 index latency with ShardIndexingPressure removed. There is also a 23% drop in p50 throughput.
Hindsight being 20/20 the backports should've occurred incrementally w/ benchmarks for each change instead of the single sweeping 7K merge benchmarked at the zero hour. This is certainly making it more difficult to root cause.
The other area of concern I have is the additional "shard level accounting" logic added to TransportBulkAction. In particular the potentially expensive call to ramBytesUsed which is accumulated regardless of Shard Index Pressure being enabled.
// Add the shard level accounting for coordinating and supply the listener
final boolean isOnlySystem = isOnlySystem(bulkRequest, clusterService.state().metadata().getIndicesLookup(), systemIndices);
final Releasable releasable = indexingPressureService.markCoordinatingOperationStarted(
shardId,
bulkShardRequest.ramBytesUsed(),
isOnlySystem
);
In the case of the nyc_taxi dataset this is going to stream a window of 10000 DocWriteRequest objects per bulk call and accumulate the total RamUsage.
return SHALLOW_SIZE + Stream.of(items).mapToLong(Accountable::ramBytesUsed).sum();
If Shard Indexing Pressure is disabled this computation is wasted on a noop.
Rather than revert the whole PR for 1.2 we could consider wrapping this critical path in something like: if (indexingPressureService.isShardIndexingPressureEnabled()) {
and apply a more elegant solution for 1.3.
Hindsight being 20/20 the backports should've occurred incrementally
Hi @nknize I think having a nightly CI which could benchmark and compare the results with previous build should be a way to detect any regressions in general. Since we had incrementally pushed out the changes in the feature/1.x branch, we would have detected existence of any such issue much ahead in time, even before the backport to 1.2 branch.
If Shard Indexing Pressure is disabled this computation is wasted on a noop.
For bulkShardRequest.ramBytesUsed()
while I understand the computed bytes of request is being passed to the markCoordinatingOperationStarted
of ShardIndexingPressure once for every request, but it would definitely not lead to a such big 20% overhead as stated in the runs above. The ramBytesUsed
computations are done repeatedly at multiple places (replica, primary and coordinator) in the code flow and such implications are not expected. These are present even in the older version of IndexingPressure and never called out as an overhead. Here in OS-1.2 for every request the overhead would be the request size computations based the number of documents in a batch. Given that this call has neither showed up in any of the CPU profiling or was neither called out as regression hit when it was first added in the indexing path during the ES-7.9 release, I don't see this as a smoking gun.
However in order to maintain sanity between these operation calls and optimise this request size computation further, I have raised a PR to delay this computation until reached a point, where absolutely necessary. This will keep the fix ready (to be merged if needed) and is inline with your suggestion functionally. It doesn’t require any additional flag check for feature https://github.com/opensearch-project/OpenSearch/pull/1589
Can we run a JMH benchmark test for https://github.com/opensearch-project/OpenSearch/blob/c459282fd67ddb17dcc545ec9bcdc805880bcbec/server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java#L212 with and without the SIP settings on
We (myself and @mch2) ran the Rally Tests again today, to compare the 1.2 changes with and without the shard indexing pressure commit. This was done using the same branch and just isolating one ShardIndexingPressure commit.
Since the previous tests were done against different distributions, it was not equivalent setup. Here numbers are pretty much similar, and definitely not alarming as called out 20% in the earlier comment. Worst case increase is reported as around 2% for p100, however since at the same time p99 is same, it mitigates any risk.
Indexing Operation Latencies (ms) Comparison
Metric | Full 1.2 | With Revert Commit in 1.2 | Difference (%) |
---|---|---|---|
P50 | 2,587.4 | 2,569.2 | -0.70839 |
P90 | 3,773.4 | 3,686 | -2.37113 |
P99 | 5,083.5 | 5,077 | -0.12803 |
P100 | 8,736.7 | 8,564 | -2.01658 |
Throughput (req/s) Comparison
Metric | Full 1.2 | With Revert Commit in 1.2 | Difference (%) |
---|---|---|---|
P0 | 14,516.1 | 14,644.4 | 0.8761 |
P50 | 15,647.6 | 15,893.2 | 1.54531 |
P100 | 18,931.2 | 19,152.6 | 1.15598 |
The server side metrics for these tests such as CPU, Memory, Threadpool utilization, Garbage collections are all identical in trend and similar in numbers. Some of the differences in the tests can be attributed to client side metric collections. Please let me know if there are any questions related to these runs.
On the other hand, for one possible optimization found (as covered in the above comment #1589), I have already raised a PR. It will reduce one CPU computation in the indexing flow further. Although it doesn’t look like a blocker, but we can get that change in as well if needed.
Since the previous tests were done against different distributions, it was not equivalent setup.
Specifically, what distributions did we run this against the first time? Sounds like there was a distribution that was 20% better?
@dblock What I understood in the previous setup from @mch2 which reported 20% degradation, one test setup was run by checking out the local branch and building the distribution locally after reverting the commit, while other was downloaded from the archived directly.
So this time we used the same setup branch, which was checked out locally for both the tests, while one had the commit reverted.
So this time we used the same setup branch, which was checked out locally for both the tests, while one had the commit reverted.
So we can verify using the official build process (matching jdks, etc) we're going to run the benchmarks again but with the CI built tarballs instead of local checkouts.
Here are measurements back for the entire test matrix between 1.1 and 1.2 with the distribution, with a single shard.
Query Latency
I confidently say that the previous comparisons around query latency were overblown, while there is drift in performance, single shard vs multi-shard had a dramatic impact and the numbers are +-20% Clearest confidence indicates that these test should be run with a 5 shard setup, which we will schedule.
Indexing Latency
For 3 of 4 tests, OpenSearch 1.2 performance dropped, but for x64 with the security plugin enabled we saw an improvement. This is not conclusive, and I would recommend more measurement focused around the OpenSearch core scenarios were there could be more risk and a smaller delta.
 P50 P90 P99 x64 Disabled 1% 1% 4% arm64 Disabled 2% 2% 6% arm64 Enabled 5% 5% 10% x64 Enabled -9% -8% -6%
Following up on where we are at with the Overall 1.1 vs 1.2 comparisons, we have the additional run data with 5 shards instead of 1 shard from the performance test. This netted a large reduction in ALL latency numbers, how we shard can have dramatic impacts on performance.
Shard Count | 1 | 5 |
---|---|---|
x64 Disabled | 3221 | 734.9 |
arm64 Disabled | 2494.5 | 539.4 |
arm64 Enabled | 2846.6 | 675 |
x64 Enabled | 3250.3 | 776.2 |
Reviewing the indexing performance numbers showed the same pattern as seen previously on indexing, with a more latency seen on arm64 in 1.2, and improvement in x64 enabled, and within the error rate for x64 disabled.
When we ran the results on the build over build tests, we had a lot of data and no 'bright line' distinction for a bad change getting introduced. We will be conducting another run, but instead focusing on arm64 enabled where there is the larger distinction between 1.1 and 1.2. Only if there is clear data will I post findings.
OpenSearch 1.2 Indexing Backpressure Removed vs Normal Distribution
Looking over these results, they are within in the 5% error percentage, no 20% outliers. This aligns with the earlier findings indicating that Indexing Backpressure presence does not have a [edit] significant [/edit] impact on the overall performance. 🍾 @getsaurabh02 @nknize
P50 | P90 | P99 | |
---|---|---|---|
x64-disable | 8.5 | 14.3 | -5.1 |
x64-enabled | 4.5 | 23.1 | 17.7 |
arm64-enable | 20.7 | 29.1 | 45 |
arm64-disable | -3.3 | -13 | 72.8 |
P50 | P90 | P99 | |
---|---|---|---|
x64-disable | 2% | 2% | 0% |
x64-enabled | 1% | 3% | 1% |
arm64-enable | 4% | 4% | 3% |
arm64-disable | -1% | -2% | 6% |
Excellent news @peternied!!!
So to summarize, the initial reported 20% indexing degradation was the result of a difference in the benchmarking configuration between the 1.1 and 1.2 build? The initial issue above indicates the 1.2.0 distro was built from CI. Were we comparing that distro build against a 1.1.0 built from a local branch checkout? Thereby leading to different jvm and machine configs?
Many thanks to all that put a huge effort into verifying this performance! @peternied @mch2 @getsaurabh02 @Bukhtawar Looking forward to getting ongoing benchmarking stood up to catch any potential issues early!
So to summarize, the initial reported 20% indexing degradation was the result of a difference in the benchmarking configuration between the 1.1 and 1.2 build?
Yes, with a slight correction, it was between a 1.2 min built by our CI vs a local build of performance/1.2-regression
branch.
This issue was ultimately determined to be a configuration mismatch between the 1.1.0 and 1.2.0 performance testing setups. Closing this as we have shipped 1.2.0
We are seeing 10%+ regression across the board compared to the OpenSearch 1.1.0 release.
Thanks to @mch2 for these numbers
Performance test data is available on https://github.com/opensearch-project/opensearch-build/issues/963, please review and create any issues if follow up is needed.
Builds under test: