nasa / opera-sds-pcm

Observational Products for End-Users from Remote Sensing Analysis (OPERA)
Apache License 2.0
16 stars 12 forks source link

[Bug]: Failing jobs and Mozart CPU max-out observed during ORT #187

Closed philipjyoon closed 1 year ago

philipjyoon commented 2 years ago

Checked for duplicates

Yes - I've already checked

Describe the bug

While running ORT we noticed that hundreds of data ingest and state-config jobs were failing after catching up downloads. This seemed to be highly correlated w the download workers for some reason. We started to lower the hls_download from 10 to 5 to 2 and eventually to 1. Once it was brought down to 1 the CPU usage on the Mozart EC2 went below 80% and we stopped getting failed jobs.

Looking at top on the Mozart box ES was using 800-1000% of CPU, 34GB of resident memory and 54GB of total memory. And then there were a dozen or so Celery processes combined using up 150-200% PCU.

We did not observe this high ES load issue in the 7-day 1x test which used rc2. Couple differences between the ORT and the load test:

  1. During ORT, on one day, I believe August 5 2022 we received about a day worth of data from CMR within one hour. This piled up the downloads.
  2. During ORT one of the ES hls_download documents had empty s3_url value and that caused all download jobs to fail continuously. This had gone one for a day or so and then it was fixed. This created a very large backlog of download jobs, I believe in the low 1000s.

What did you expect?

I expected the system to process all jobs without failure

Reproducible steps

It's hard to say if this bug is reproducible and how. But the way we got the system into such state is documented in PRS#1-3 here:
https://wiki.jpl.nasa.gov/pages/viewpage.action?pageId=596958066

Environment

1.0.0-rc.3.0
philipjyoon commented 2 years ago

image

image

philipjyoon commented 2 years ago

Looked at the INT FWD cluster which runs 1.0.0-rc.3.0 and was used for INT testing. The CPU usage on Mozart seems normal after the system has run through all data. One difference is that this cluster doesn't have nearly as many granules, and therefore jobs, as POP1

image

image

philipjyoon commented 2 years ago

Deployed rc3 on a personal dev cluster and ran download query jobs from Aug 4 2022 to Aug 9 2022, by one hour increment like so, all right after another: /export/home/hysdsops/mozart/bin/python /export/home/hysdsops/mozart/ops/opera-pcm/data_subscriber/daac_data_subscriber.py query -s 2022-04-08T00:00:00Z -e 2022-08-05T00:00:00Z -c HLSL30 -p LPCLOUD -i opera-dev-isl-fwd-pyoon --release-version=1.0.0-rc.3.0 --job-queue=opera-job_worker-hls_data_download --chunk-size=80 /export/home/hysdsops/mozart/bin/python /export/home/hysdsops/mozart/ops/opera-pcm/data_subscriber/daac_data_subscriber.py query -s 2022-08-05T00:00:00Z -e 2022-08-06T00:00:00Z -c HLSL30 -p LPCLOUD -i opera-dev-isl-fwd-pyoon --release-version=1.0.0-rc.3.0 --job-queue=opera-job_worker-hls_data_download --chunk-size=80 /export/home/hysdsops/mozart/bin/python /export/home/hysdsops/mozart/ops/opera-pcm/data_subscriber/daac_data_subscriber.py query -s 2022-08-06T00:00:00Z -e 2022-08-07T00:00:00Z -c HLSL30 -p LPCLOUD -i opera-dev-isl-fwd-pyoon --release-version=1.0.0-rc.3.0 --job-queue=opera-job_worker-hls_data_download --chunk-size=80 /export/home/hysdsops/mozart/bin/python /export/home/hysdsops/mozart/ops/opera-pcm/data_subscriber/daac_data_subscriber.py query -s 2022-08-07T00:00:00Z -e 2022-08-08T00:00:00Z -c HLSL30 -p LPCLOUD -i opera-dev-isl-fwd-pyoon --release-version=1.0.0-rc.3.0 --job-queue=opera-job_worker-hls_data_download --chunk-size=80 /export/home/hysdsops/mozart/bin/python /export/home/hysdsops/mozart/ops/opera-pcm/data_subscriber/daac_data_subscriber.py query -s 2022-08-08T00:00:00Z -e 2022-08-09T00:00:00Z -c HLSL30 -p LPCLOUD -i opera-dev-isl-fwd-pyoon --release-version=1.0.0-rc.3.0 --job-queue=opera-job_worker-hls_data_download --chunk-size=80

Initially I was getting 200-300% CPU on the ES process on Mozart using ASG 10 download, 70 ingest, 90 state-config, and 500 sciflo.

And then an hour or so later I increased ingest to 400 and state-config to 500. Very soon after I started to see 800-1000% CPU usage for the Mozart ES but celery workers were staying under 1% CPU

image

image

image

image

philipjyoon commented 2 years ago

At above time all the jobs hadn't come online yet. Another 20-30 mins later, when all the job workers came online I started to see the celery processes use 1-4% CPU instead of less than 1% about an hour prior. And then finally started to see some ingest jobs failing due to the exception "AttributeError: 'ChannelPromise' object has no attribute 'value'" This is exactly what we observed during ORT.

Figaro Overview image

Start of job failures image

Mozart EC2 CPU utilization image

Mozart top image

philipjyoon commented 2 years ago

The job failures seem to be correlated w the celery workers using more CPU. Also observed the frantic pace of garbage collection events on Mozart ES

(mozart) hysdsops@ip-100-104-40-207:/var/log/elasticsearch$ cat resource_cluster.log [2022-08-16T00:13:05,090][INFO ][o.e.n.Node ] [master] version[7.9.3], pid[1931], build[default/rpm/c4138e51121ef06a6404866cddc601906fe5c868/2020-10-16T10:36:16.141335Z], OS[Linux/4.18.0-372.9.1.el8.x86_64/amd64], JVM[Red Hat, Inc./OpenJDK 64-Bit Server VM/11.0.15/11.0.15+10-LTS] [2022-08-16T00:13:05,102][INFO ][o.e.n.Node ] [master] JVM home [/usr/lib/jvm/java-11-openjdk-11.0.15.0.10-2.el8_6.x86_64] [2022-08-16T00:13:05,103][INFO ][o.e.n.Node ] [master] JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=/tmp/elasticsearch-7617265112103408943, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Dlog4j2.formatMsgNoLookups=true, -Xms31g, -Xmx31g, -XX:MaxDirectMemorySize=16642998272, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=rpm, -Des.bundled_jdk=true] [2022-08-16T00:13:20,958][INFO ][o.e.p.PluginsService ] [master] loaded module [aggs-matrix-stats] [2022-08-16T00:13:20,959][INFO ][o.e.p.PluginsService ] [master] loaded module [analysis-common] [2022-08-16T00:13:20,959][INFO ][o.e.p.PluginsService ] [master] loaded module [constant-keyword] [2022-08-16T00:13:20,959][INFO ][o.e.p.PluginsService ] [master] loaded module [flattened] [2022-08-16T00:13:20,959][INFO ][o.e.p.PluginsService ] [master] loaded module [frozen-indices] [2022-08-16T00:13:20,959][INFO ][o.e.p.PluginsService ] [master] loaded module [ingest-common] [2022-08-16T00:13:20,959][INFO ][o.e.p.PluginsService ] [master] loaded module [ingest-geoip] [2022-08-16T00:13:20,960][INFO ][o.e.p.PluginsService ] [master] loaded module [ingest-user-agent] [2022-08-16T00:13:20,960][INFO ][o.e.p.PluginsService ] [master] loaded module [kibana] [2022-08-16T00:13:20,960][INFO ][o.e.p.PluginsService ] [master] loaded module [lang-expression] [2022-08-16T00:13:20,960][INFO ][o.e.p.PluginsService ] [master] loaded module [lang-mustache] [2022-08-16T00:13:20,960][INFO ][o.e.p.PluginsService ] [master] loaded module [lang-painless] [2022-08-16T00:13:20,960][INFO ][o.e.p.PluginsService ] [master] loaded module [mapper-extras] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [parent-join] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [percolator] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [rank-eval] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [reindex] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [repository-url] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [search-business-rules] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [searchable-snapshots] [2022-08-16T00:13:20,961][INFO ][o.e.p.PluginsService ] [master] loaded module [spatial] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [systemd] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [tasks] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [transform] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [transport-netty4] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [vectors] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [wildcard] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-analytics] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-async] [2022-08-16T00:13:20,962][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-async-search] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-autoscaling] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-ccr] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-core] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-data-streams] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-deprecation] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-enrich] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-eql] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-graph] [2022-08-16T00:13:20,963][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-identity-provider] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-ilm] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-logstash] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-ml] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-monitoring] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-ql] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-rollup] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-security] [2022-08-16T00:13:20,964][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-sql] [2022-08-16T00:13:20,965][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-stack] [2022-08-16T00:13:20,965][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-voting-only-node] [2022-08-16T00:13:20,965][INFO ][o.e.p.PluginsService ] [master] loaded module [x-pack-watcher] [2022-08-16T00:13:20,965][INFO ][o.e.p.PluginsService ] [master] loaded plugin [repository-s3] [2022-08-16T00:13:21,115][INFO ][o.e.e.NodeEnvironment ] [master] using [1] data paths, mounts [[/ (/dev/nvme0n1p1)]], net usable_space [83.5gb], net total_space [99.9gb], types [xfs] [2022-08-16T00:13:21,115][INFO ][o.e.e.NodeEnvironment ] [master] heap size [30.8gb], compressed ordinary object pointers [true] [2022-08-16T00:13:21,348][INFO ][o.e.n.Node ] [master] node name [master], node ID [mABvbhdtTZOIwH1H5Ld1lQ], cluster name [resource_cluster] [2022-08-16T00:13:24,437][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [master] [controller/6286] [Main.cc@114] controller (64 bit): Version 7.9.3 (Build 6c27059cb8397a) Copyright (c) 2020 Elasticsearch BV [2022-08-16T00:13:25,016][INFO ][o.e.x.s.a.s.FileRolesStore] [master] parsed [0] roles from file [/etc/elasticsearch/roles.yml] [2022-08-16T00:13:26,718][INFO ][o.e.t.NettyAllocator ] [master] creating NettyAllocator with the following configs: [name=elasticsearch_configured, chunk_size=1mb, factors={es.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=false, g1gc_region_size=0b}] [2022-08-16T00:13:26,807][INFO ][o.e.d.DiscoveryModule ] [master] using discovery type [zen] and seed hosts providers [settings] [2022-08-16T00:13:27,255][WARN ][o.e.g.DanglingIndicesState] [master] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually [2022-08-16T00:13:27,672][INFO ][o.e.n.Node ] [master] initialized [2022-08-16T00:13:27,672][INFO ][o.e.n.Node ] [master] starting ...

[2022-08-16T00:13:28,016][INFO ][o.e.b.BootstrapChecks ] [master] bound or publishing to a non-loopback address, enforcing bootstrap checks [2022-08-16T00:13:28,036][INFO ][o.e.c.c.Coordinator ] [master] cluster UUID [g_ZIhrwdQmuuvXxOuF7KIg] [2022-08-16T00:13:28,192][INFO ][o.e.c.s.MasterService ] [master] elected-as-master ([1] nodes joined)[{master}{mABvbhdtTZOIwH1H5Ld1lQ}{DyB6G-4kTROoJE7rvWNJgw}{100.104.40.207}{100.104.40.207:9300}{dilmrt}{ml.machine_memory=133362839552, xpack.installed=true, transform.node=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTERTASK, _FINISHELECTION], term: 2, version: 36, delta: master node changed {previous [], current [{master}{mABvbhdtTZOIwH1H5Ld1lQ}{DyB6G-4kTROoJE7rvWNJgw}{100.104.40.207}{100.104.40.207:9300}{dilmrt}{ml.machine_memory=133362839552, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]} [2022-08-16T00:13:28,239][INFO ][o.e.c.s.ClusterApplierService] [master] master node changed {previous [], current [{master}{mABvbhdtTZOIwH1H5Ld1lQ}{DyB6G-4kTROoJE7rvWNJgw}{100.104.40.207}{100.104.40.207:9300}{dilmrt}{ml.machine_memory=133362839552, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]}, term: 2, version: 36, reason: Publication{term=2, version=36}

[2022-08-16T00:13:28,281][INFO ][o.e.n.Node ] [master] started [2022-08-16T00:13:28,558][INFO ][o.e.l.LicenseService ] [master] license [f14ea223-1881-42c4-9f23-9fdd10caca18] mode [basic] - valid [2022-08-16T00:13:28,561][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [master] Active license is now [BASIC]; Security is disabled [2022-08-16T00:13:28,571][INFO ][o.e.g.GatewayService ] [master] recovered [0] indices into cluster_state [2022-08-16T00:18:41,936][WARN ][o.e.c.m.MetadataIndexTemplateService] [master] legacy template [index_defaults] has index patterns [] matching patterns from existing composable templates [metrics,logs] with patterns (metrics => [metrics--],logs => [logs--]); this template [index_defaults] may be ignored in favor of a composable template at index creation time [2022-08-16T00:18:41,937][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [index_defaults] for index patterns [] [2022-08-16T00:19:13,560][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [user_rules-mozart] creating index, cause [api], templates [index_defaults], shards [8]/[1] [2022-08-16T00:19:15,341][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [hysds_ios-mozart] creating index, cause [api], templates [index_defaults], shards [8]/[1] [2022-08-16T00:19:20,310][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [containers] for index patterns [containers] [2022-08-16T00:19:20,350][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [job_specs] for index patterns [job_specs] [2022-08-16T00:19:20,399][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [hysds_ios] for index patterns [hysds_ios] [2022-08-16T00:20:03,526][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [user_rules-grq] creating index, cause [api], templates [index_defaults], shards [8]/[1] [2022-08-16T00:20:04,903][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [hysds_ios-grq] creating index, cause [api], templates [index_defaults], shards [8]/[1] [2022-08-16T00:23:08,944][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [containers] creating index, cause [auto(bulk api)], templates [index_defaults, containers], shards [8]/[1] [2022-08-16T00:23:09,207][INFO ][o.e.c.m.MetadataMappingService] [master] [containers/OVlIBqhESt6GRXpVDU1K9g] update_mapping [_doc] [2022-08-16T00:23:09,261][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [job_specs] creating index, cause [auto(bulk api)], templates [index_defaults, job_specs], shards [8]/[1] [2022-08-16T00:23:09,522][INFO ][o.e.c.m.MetadataMappingService] [master] [job_specs/QlhzMV3HQDSaFIEJ06i_ow] update_mapping [_doc] [2022-08-16T00:23:09,573][INFO ][o.e.c.m.MetadataMappingService] [master] [job_specs/QlhzMV3HQDSaFIEJ06i_ow] update_mapping [_doc] [2022-08-16T00:23:09,698][INFO ][o.e.c.m.MetadataMappingService] [master] [job_specs/QlhzMV3HQDSaFIEJ06i_ow] update_mapping [_doc] [2022-08-16T00:24:23,998][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [task_status] for index patterns [task_status] [2022-08-16T00:24:24,135][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [job_status] for index patterns [job_status] [2022-08-16T00:24:24,178][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [event_status] for index patterns [event_status] [2022-08-16T00:24:24,235][INFO ][o.e.c.m.MetadataIndexTemplateService] [master] adding template [worker_status] for index patterns [worker_status] [2022-08-16T00:24:26,573][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [worker_status-current] creating index, cause [auto(bulk api)], templates [worker_status, index_defaults], shards [8]/[1] [2022-08-16T00:24:26,723][INFO ][o.e.c.m.MetadataMappingService] [master] [worker_status-current/eCfOTVIlRh6_so2B4rNJVQ] update_mapping [_doc] [2022-08-16T00:28:06,708][INFO ][o.e.r.RepositoriesService] [master] put repository [snapshot-repository] [2022-08-16T00:28:07,077][WARN ][c.a.s.s.i.UseArnRegionResolver] [master] Unable to load config file null java.security.AccessControlException: access denied ("java.io.FilePermission" "/nonexistent/.aws/config" "read") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:?] at java.security.AccessController.checkPermission(AccessController.java:897) ~[?:?] at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?] at java.lang.SecurityManager.checkRead(SecurityManager.java:661) ~[?:?] at java.io.File.exists(File.java:826) ~[?:?] at com.amazonaws.profile.path.config.SharedConfigDefaultLocationProvider.getLocation(SharedConfigDefaultLocationProvider.java:36) ~[aws-java-sdk-core-1.11.749.jar:?] at com.amazonaws.profile.path.AwsProfileFileLocationProviderChain.getLocation(AwsProfileFileLocationProviderChain.java:41) ~[aws-java-sdk-core-1.11.749.jar:?] at com.amazonaws.services.s3.internal.UseArnRegionResolver.getProfilesConfigFile(UseArnRegionResolver.java:110) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.internal.UseArnRegionResolver.getProfile(UseArnRegionResolver.java:96) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.internal.UseArnRegionResolver.profile(UseArnRegionResolver.java:76) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.internal.UseArnRegionResolver.resolveUseArnRegion(UseArnRegionResolver.java:64) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.internal.UseArnRegionResolver.(UseArnRegionResolver.java:53) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.internal.UseArnRegionResolver.(UseArnRegionResolver.java:48) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:435) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.AmazonS3Builder$1.apply(AmazonS3Builder.java:35) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.AmazonS3Builder$1.apply(AmazonS3Builder.java:32) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.AmazonS3ClientBuilder.build(AmazonS3ClientBuilder.java:64) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.services.s3.AmazonS3ClientBuilder.build(AmazonS3ClientBuilder.java:28) [aws-java-sdk-s3-1.11.749.jar:?] at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46) [aws-java-sdk-core-1.11.749.jar:?] at java.security.AccessController.doPrivileged(Native Method) [?:?] at org.elasticsearch.repositories.s3.SocketAccess.doPrivileged(SocketAccess.java:42) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.s3.S3Service.buildClient(S3Service.java:164) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.s3.S3Service.client(S3Service.java:96) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.s3.S3BlobStore.clientReference(S3BlobStore.java:125) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.s3.S3BlobContainer.executeSingleUpload(S3BlobContainer.java:355) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$1(S3BlobContainer.java:131) [repository-s3-7.9.3.jar:7.9.3] at java.security.AccessController.doPrivileged(Native Method) [?:?] at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:48) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:129) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlobAtomic(S3BlobContainer.java:146) [repository-s3-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.startVerification(BlobStoreRepository.java:1202) [elasticsearch-7.9.3.jar:7.9.3] at org.elasticsearch.repositories.RepositoriesService$3.doRun(RepositoriesService.java:252) [elasticsearch-7.9.3.jar:7.9.3] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.9.3.jar:7.9.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.9.3.jar:7.9.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] [2022-08-16T00:28:10,254][INFO ][o.e.x.s.a.TransportPutSnapshotLifecycleAction] [master] adding new snapshot lifecycle [hourly-snapshot] [2022-08-16T00:28:10,285][INFO ][o.e.x.s.SnapshotLifecycleService] [master] scheduling snapshot lifecycle job [hourly-snapshot-1] [2022-08-16T00:38:01,922][INFO ][o.e.c.m.MetadataMappingService] [master] [job_specs/QlhzMV3HQDSaFIEJ06i_ow] update_mapping [_doc] [2022-08-16T00:38:02,180][INFO ][o.e.c.m.MetadataMappingService] [master] [job_specs/QlhzMV3HQDSaFIEJ06i_ow] update_mapping [_doc] [2022-08-16T00:38:02,438][INFO ][o.e.c.m.MetadataMappingService] [master] [job_specs/QlhzMV3HQDSaFIEJ06i_ow] update_mapping [_doc] [2022-08-16T00:38:04,362][INFO ][o.e.c.m.MetadataMappingService] [master] [job_specs/QlhzMV3HQDSaFIEJ06i_ow] update_mapping [_doc] [2022-08-16T00:42:10,075][INFO ][o.e.c.m.MetadataDeleteIndexService] [master] [user_rules-grq/kwH2I36lQXeyFaYvNcn52g] deleting index [2022-08-16T00:42:10,154][INFO ][o.e.c.m.MetadataDeleteIndexService] [master] [user_rules-mozart/rmimwxKNSFytKpc_rE-coQ] deleting index [2022-08-16T00:42:12,672][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [user_rules-grq] creating index, cause [api], templates [index_defaults], shards [8]/[1] [2022-08-16T00:42:12,746][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [user_rules-mozart] creating index, cause [api], templates [index_defaults], shards [8]/[1] [2022-08-16T00:42:18,287][INFO ][o.e.c.m.MetadataMappingService] [master] [user_rules-mozart/EVZLIdNRTHm4Yj60d_rCfA] update_mapping [_doc] [2022-08-16T00:42:32,459][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [task_status-current] creating index, cause [auto(bulk api)], templates [task_status, index_defaults], shards [8]/[1] [2022-08-16T00:42:32,700][INFO ][o.e.c.m.MetadataMappingService] [master] [task_status-current/Wpi7bH5gST2A_PgwyaRICg] update_mapping [_doc] [2022-08-16T00:42:32,731][INFO ][o.e.c.m.MetadataMappingService] [master] [task_status-current/Wpi7bH5gST2A_PgwyaRICg] update_mapping [_doc] [2022-08-16T00:42:48,515][INFO ][o.e.c.m.MetadataMappingService] [master] [task_status-current/Wpi7bH5gST2A_PgwyaRICg] update_mapping [_doc] [2022-08-16T01:00:00,002][INFO ][o.e.x.s.SnapshotLifecycleTask] [master] snapshot lifecycle policy [hourly-snapshot] issuing create snapshot [mozart-backup-2022.08.16-mjyhj67cqls3fca8a-w3qw] [2022-08-16T01:00:00,006][INFO ][o.e.x.s.SnapshotLifecycleTask] [master] snapshot lifecycle policy job [hourly-snapshot-1] issued new snapshot creation for [mozart-backup-2022.08.16-mjyhj67cqls3fca8a-w3qw] successfully [2022-08-16T01:00:00,123][INFO ][o.e.s.SnapshotsService ] [master] snapshot [snapshot-repository:mozart-backup-2022.08.16-mjyhj67cqls3fca8a-w3qw/hnvUq7X3Sk67Oyb1QCNd5w] started [2022-08-16T01:00:10,521][INFO ][o.e.s.SnapshotsService ] [master] snapshot [snapshot-repository:mozart-backup-2022.08.16-mjyhj67cqls3fca8a-w3qw/hnvUq7X3Sk67Oyb1QCNd5w] completed with state [SUCCESS] [2022-08-16T01:00:10,552][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [.slm-history-2-000001] creating index, cause [api], templates [.slm-history], shards [1]/[0] [2022-08-16T01:00:10,588][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [slm-history-ilm-policy] [2022-08-16T01:00:10,652][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"}] in policy [slm-history-ilm-policy] [2022-08-16T01:00:10,677][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}] in policy [slm-history-ilm-policy] [2022-08-16T01:00:11,047][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [ilm-history-2-000001] creating index, cause [api], templates [ilm-history], shards [1]/[0] [2022-08-16T01:00:11,099][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:00:11,124][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:00:11,175][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:03:28,264][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}] to [{"phase":"hot","action":"unfollow","name":"pause-follower-index"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:03:28,286][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}] to [{"phase":"hot","action":"unfollow","name":"pause-follower-index"}] in policy [slm-history-ilm-policy] [2022-08-16T01:03:28,308][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"pause-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"close-follower-index"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:03:28,328][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"pause-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"close-follower-index"}] in policy [slm-history-ilm-policy] [2022-08-16T01:03:28,347][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"close-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:03:28,370][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"close-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}] in policy [slm-history-ilm-policy] [2022-08-16T01:03:28,389][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"open-follower-index"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:03:28,408][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"open-follower-index"}] in policy [slm-history-ilm-policy] [2022-08-16T01:03:28,427][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"open-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-yellow-step"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:03:28,446][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"open-follower-index"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-yellow-step"}] in policy [slm-history-ilm-policy] [2022-08-16T01:03:28,466][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [ilm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-yellow-step"}] to [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}] in policy [ilm-history-ilm-policy] [2022-08-16T01:03:28,486][INFO ][o.e.x.i.IndexLifecycleTransition] [master] moving index [.slm-history-2-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-yellow-step"}] to [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}] in policy [slm-history-ilm-policy] [2022-08-16T01:30:00,002][INFO ][o.e.x.s.SnapshotRetentionTask] [master] starting SLM retention snapshot cleanup task [2022-08-16T01:30:00,139][INFO ][o.e.x.s.SnapshotRetentionTask] [master] SLM retention snapshot cleanup task complete [2022-08-16T02:00:00,007][INFO ][o.e.x.s.SnapshotLifecycleTask] [master] snapshot lifecycle policy [hourly-snapshot] issuing create snapshot [mozart-backup-2022.08.16-pyum7eeaqcopwsxexe6ftw] [2022-08-16T02:00:00,008][INFO ][o.e.x.s.SnapshotLifecycleTask] [master] snapshot lifecycle policy job [hourly-snapshot-1] issued new snapshot creation for [mozart-backup-2022.08.16-pyum7eeaqcopwsxexe6ftw] successfully [2022-08-16T02:00:00,017][INFO ][o.e.s.SnapshotsService ] [master] snapshot [snapshot-repository:mozart-backup-2022.08.16-pyum7eeaqcopwsxexe6ftw/PdOpwNRTRp646XTkt2sR_w] started [2022-08-16T02:00:13,461][INFO ][o.e.s.SnapshotsService ] [master] snapshot [snapshot-repository:mozart-backup-2022.08.16-pyum7eeaqcopwsxexe6ftw/PdOpwNRTRp646XTkt2sR_w] completed with state [SUCCESS] [2022-08-16T02:28:10,209][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [job_status-current] creating index, cause [auto(bulk api)], templates [job_status, index_defaults], shards [8]/[1] [2022-08-16T02:28:10,457][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T02:32:43,409][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T02:41:47,437][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T02:46:39,162][INFO ][o.e.c.m.MetadataMappingService] [master] [task_status-current/Wpi7bH5gST2A_PgwyaRICg] update_mapping [_doc] [2022-08-16T02:47:55,374][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T02:47:56,137][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T02:48:07,549][INFO ][o.e.c.m.MetadataCreateIndexService] [master] [event_status-current] creating index, cause [auto(bulk api)], templates [event_status, index_defaults], shards [8]/[1] [2022-08-16T02:48:07,795][INFO ][o.e.c.m.MetadataMappingService] [master] [event_status-current/is3n5-mXTdSgcQ3ws5owSw] update_mapping [_doc] [2022-08-16T02:55:50,361][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T02:57:31,409][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T03:00:00,002][INFO ][o.e.x.s.SnapshotLifecycleTask] [master] snapshot lifecycle policy [hourly-snapshot] issuing create snapshot [mozart-backup-2022.08.16-xfombasks9y6n3q59tvjog] [2022-08-16T03:00:00,004][INFO ][o.e.x.s.SnapshotLifecycleTask] [master] snapshot lifecycle policy job [hourly-snapshot-1] issued new snapshot creation for [mozart-backup-2022.08.16-xfombasks9y6n3q59tvjog] successfully [2022-08-16T03:00:00,020][INFO ][o.e.s.SnapshotsService ] [master] snapshot [snapshot-repository:mozart-backup-2022.08.16-xfombasks9y6n3q59tvjog/TzwjqnASQ7WuV9RN3-51BA] started [2022-08-16T03:00:28,844][INFO ][o.e.s.SnapshotsService ] [master] snapshot [snapshot-repository:mozart-backup-2022.08.16-xfombasks9y6n3q59tvjog/TzwjqnASQ7WuV9RN3-51BA] completed with state [SUCCESS] [2022-08-16T03:09:17,965][INFO ][o.e.c.m.MetadataMappingService] [master] [job_status-current/zyzaMEQZS6K9QnFKXe3TLg] update_mapping [_doc] [2022-08-16T03:45:14,874][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12697] overhead, spent [254ms] collecting in the last [1s] [2022-08-16T03:45:23,935][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12706] overhead, spent [288ms] collecting in the last [1s] [2022-08-16T03:45:25,943][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12708] overhead, spent [261ms] collecting in the last [1s] [2022-08-16T03:45:28,944][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12711] overhead, spent [311ms] collecting in the last [1s] [2022-08-16T03:45:29,944][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12712] overhead, spent [263ms] collecting in the last [1s] [2022-08-16T03:45:30,953][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12713] overhead, spent [282ms] collecting in the last [1s] [2022-08-16T03:45:31,953][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12714] overhead, spent [255ms] collecting in the last [1s] [2022-08-16T03:45:46,019][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12728] overhead, spent [297ms] collecting in the last [1s] [2022-08-16T03:45:49,055][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12731] overhead, spent [270ms] collecting in the last [1s] [2022-08-16T03:45:51,056][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12733] overhead, spent [281ms] collecting in the last [1s] [2022-08-16T03:45:52,065][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12734] overhead, spent [264ms] collecting in the last [1s] [2022-08-16T03:45:57,146][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12739] overhead, spent [272ms] collecting in the last [1s] [2022-08-16T03:45:58,147][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12740] overhead, spent [285ms] collecting in the last [1s] [2022-08-16T03:46:03,248][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12745] overhead, spent [330ms] collecting in the last [1s] [2022-08-16T03:46:10,360][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12752] overhead, spent [281ms] collecting in the last [1s] [2022-08-16T03:46:14,437][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12756] overhead, spent [337ms] collecting in the last [1s] [2022-08-16T03:46:15,437][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12757] overhead, spent [273ms] collecting in the last [1s] [2022-08-16T03:46:16,443][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12758] overhead, spent [261ms] collecting in the last [1s] [2022-08-16T03:46:20,452][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12762] overhead, spent [288ms] collecting in the last [1s] [2022-08-16T03:46:21,453][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12763] overhead, spent [278ms] collecting in the last [1s] [2022-08-16T03:46:23,460][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12765] overhead, spent [284ms] collecting in the last [1s] [2022-08-16T03:46:25,474][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12767] overhead, spent [298ms] collecting in the last [1s] [2022-08-16T03:46:26,474][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12768] overhead, spent [292ms] collecting in the last [1s] [2022-08-16T03:46:30,732][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12772] overhead, spent [395ms] collecting in the last [1.2s] [2022-08-16T03:46:31,733][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12773] overhead, spent [320ms] collecting in the last [1s] [2022-08-16T03:46:33,737][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12775] overhead, spent [254ms] collecting in the last [1s] [2022-08-16T03:46:35,744][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12777] overhead, spent [315ms] collecting in the last [1s] [2022-08-16T03:46:41,857][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12783] overhead, spent [330ms] collecting in the last [1s] [2022-08-16T03:46:43,953][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12785] overhead, spent [310ms] collecting in the last [1s] [2022-08-16T03:46:44,954][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12786] overhead, spent [283ms] collecting in the last [1s] [2022-08-16T03:46:45,955][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12787] overhead, spent [297ms] collecting in the last [1s] [2022-08-16T03:46:46,956][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12788] overhead, spent [360ms] collecting in the last [1s] [2022-08-16T03:46:47,981][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12789] overhead, spent [276ms] collecting in the last [1s] [2022-08-16T03:46:49,987][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12791] overhead, spent [328ms] collecting in the last [1s] [2022-08-16T03:46:51,008][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12792] overhead, spent [328ms] collecting in the last [1s] [2022-08-16T03:46:52,073][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12793] overhead, spent [300ms] collecting in the last [1s] [2022-08-16T03:46:53,079][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12794] overhead, spent [389ms] collecting in the last [1s] [2022-08-16T03:46:55,083][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12796] overhead, spent [283ms] collecting in the last [1s] [2022-08-16T03:46:56,086][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12797] overhead, spent [353ms] collecting in the last [1s] [2022-08-16T03:46:59,141][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12800] overhead, spent [271ms] collecting in the last [1s] [2022-08-16T03:47:00,142][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12801] overhead, spent [281ms] collecting in the last [1s] [2022-08-16T03:47:02,143][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12803] overhead, spent [301ms] collecting in the last [1s] [2022-08-16T03:47:04,146][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12805] overhead, spent [315ms] collecting in the last [1s] [2022-08-16T03:47:05,153][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12806] overhead, spent [313ms] collecting in the last [1s] [2022-08-16T03:47:06,198][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12807] overhead, spent [285ms] collecting in the last [1s] [2022-08-16T03:47:07,216][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12808] overhead, spent [330ms] collecting in the last [1s] [2022-08-16T03:47:13,283][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12814] overhead, spent [398ms] collecting in the last [1s] [2022-08-16T03:47:16,310][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12817] overhead, spent [357ms] collecting in the last [1s] [2022-08-16T03:47:18,379][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12819] overhead, spent [317ms] collecting in the last [1s] [2022-08-16T03:47:19,415][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12820] overhead, spent [267ms] collecting in the last [1s] [2022-08-16T03:47:20,416][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12821] overhead, spent [336ms] collecting in the last [1s] [2022-08-16T03:47:22,418][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12823] overhead, spent [355ms] collecting in the last [1s] [2022-08-16T03:47:26,540][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12827] overhead, spent [291ms] collecting in the last [1s] [2022-08-16T03:47:27,541][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12828] overhead, spent [320ms] collecting in the last [1s] [2022-08-16T03:47:29,594][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12830] overhead, spent [363ms] collecting in the last [1s] [2022-08-16T03:47:31,614][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12832] overhead, spent [291ms] collecting in the last [1s] [2022-08-16T03:47:33,624][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12834] overhead, spent [280ms] collecting in the last [1s] [2022-08-16T03:47:35,631][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12836] overhead, spent [294ms] collecting in the last [1s] [2022-08-16T03:47:38,663][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12839] overhead, spent [285ms] collecting in the last [1s] [2022-08-16T03:47:39,678][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12840] overhead, spent [395ms] collecting in the last [1s] [2022-08-16T03:47:41,680][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12842] overhead, spent [269ms] collecting in the last [1s] [2022-08-16T03:47:42,681][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12843] overhead, spent [263ms] collecting in the last [1s] [2022-08-16T03:47:44,682][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12845] overhead, spent [263ms] collecting in the last [1s] [2022-08-16T03:47:47,777][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12848] overhead, spent [318ms] collecting in the last [1s] [2022-08-16T03:47:49,780][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12850] overhead, spent [265ms] collecting in the last [1s] [2022-08-16T03:47:50,787][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12851] overhead, spent [271ms] collecting in the last [1s] [2022-08-16T03:47:52,795][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12853] overhead, spent [325ms] collecting in the last [1s] [2022-08-16T03:47:53,845][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12854] overhead, spent [281ms] collecting in the last [1s] [2022-08-16T03:47:54,855][INFO ][o.e.m.j.JvmGcMonitorService] [master] [gc][12855] overhead, spent [317ms] collecting in the last [1s]

philipjyoon commented 2 years ago

Garbage collection is happening even though both resident and total memory usage of ES hasn't changed much. These values are also lower than what we saw on ORT which used up to 52GB of total memory. So clearly the ES process allocate 10GB more of memory but instead it's chosen to GC frantically.

philipjyoon commented 2 years ago

On ORT machine I'm seeing 30%+ CPU for Celery workers with very little load

image

philipjyoon commented 1 year ago

This issue is now well-understood and have been mitigated to the point where it's no longer an issue. We've upgraded the Mozart EC2 size, updated ES so that it uses less resources, and reduced the number of jobs per granule processed.