Open moooofly opened 5 years ago
结论:原因是 td-agent 在高峰期处理日志的能力远低于 kong qps ,高峰日志被 log rotate 了;
隐藏信息:
例如,假设当前生成的日志文件为 kong.log ,那么被 rotate 过的文件会被命名为 kong.log.1 等;而 td-agent 应该是(猜测)只会读取 kong.log 日志的内容,因此,一旦出现,日志量过大的情况,必然会导致 rotate 的频率变高,与此同时 tg-agent 对日志的处理也会耗费大量的 CPU ,量大会导致处理不过来的情况;可能导致的情况就是,当前 kong.log 还没有处理完,就已经触发了 rotate 动作,之后 td-agent 转为处理新的 kong.log ,而原来的 kong.log 中的内容可能还有一部分没有被处理完;
从 http://hk.llsops.com/app/kibana#/home?_g=() 上可以看到,高峰期日志量又被“削平”情况
要想看出这种“削平”问题,时间粒度必须要小,否则看起来容易被忽略;
目前的情况是,查询 7 天的数据就比较费劲,很可能就查询超时了;
查询超过 30s 就被阻止了;
Discover: Request Timeout after 30000ms
比较靠谱的做法是,在查询时指定 filter ,参照这里&_a=(columns:!(kubernetes.container_name,log),filters:!((%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%277921b670-1846-11e8-b5df-f570976454ee%27,key:kubernetes.container_name,negate:!f,params:(query:algorithm-rls-kprecommender-grader,type:phrase),type:phrase,value:algorithm-rls-kprecommender-grader),query:(match:(kubernetes.container_name:(query:algorithm-rls-kprecommender-grader,type:phrase))))),index:%277921b670-1846-11e8-b5df-f570976454ee%27,interval:auto,query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:%27%27,query:%27%27))),sort:!(%27@timestamp%27,desc),uiState:(spy:(mode:(fill:!f,name:!n)))))
liang: 如果日质量比较大,或者用户的查询条件比较低效(比如用到了 wildcard 查询),因为 es 里面是有 query queue 的,那么很有可能会导致 query queue 满了,并且前端没有错误提示,但是只返回少量的结果,这种情况可能会导致误以为是 index 部分有延迟,但是实际 index 是没有问题,只是查询 partil failed 而已。可以在 kibana monitoring 里面查看各项指标来进一步确认(http://k.llsops.com/app/monitoring#/overview)
关于 ELK
业务人员的说法:
问题:
业务人员报告:stag-k 上从 3月24日起看不到日志了;
3月28日的日志是调整后才出现的;
3月28日排查问题时发现,logstash 的 Events Received Rate 和 Events Emitted Rate 都为 0 ;同时 logstash 的 CPU Usage 和 Load average 也都为 0 ;所以,首先怀疑了是否 kafka 出了什么问题,导致消息没有再转发给 logstash ;
经过 kafka 运维同学排查,发现“我看到 x-logstash 这个consumer-group不在”(这个为啥会不在,原因不确定),并确定 kafka 没有啥问题;
重新排查 elk 这边的情况,发现
[2019-03-28T08:08:00,385][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2019-03-28T08:08:00,385][INFO ][logstash.outputs.elasticsearch] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>125}
[2019-03-25T05:31:45,487][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:140) ~[?:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_151]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_151]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_151]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_151]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_151]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_151]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:141) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]
... 12 more
[2019-03-25T05:31:45,488][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:140) ~[?:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_151]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_151]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_151]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_151]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_151]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_151]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:141) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]
... 12 more
[2019-03-25T05:31:45,489][WARN ][o.e.x.m.MonitoringService] [elk-master-stag] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:211) ~[?:?]
at org.elasticsearch.xpack.common.IteratingActionListener.onResponse(IteratingActionListener.java:108) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:175) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:147) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:167) ~[?:?]
... 26 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:136) ~[?:?]
... 24 more
[2019-03-25T05:31:55,176][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-kibana-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-kibana-6-2019.03.25][0]] containing [index {[.monitoring-kibana-6-2019.03.25][doc][To9Us2kB1ahbx08ChwtI], source[{"cluster_uuid":"sQpdXWdLRQq8FnlGMO3zfw","timestamp":"2019-03-25T05:30:55.174Z","interval_ms":10000,"type":"kibana_stats","source_node":{"uuid":"kFkRHqR1QzuFzNiEnp2tXw","host":"10.128.45.224","transport_address":"10.128.45.224:9300","ip":"10.128.45.224","name":"elk-master-stag","timestamp":"2019-03-25T05:30:55.174Z"},"kibana_stats":{"cloud":{"name":"aws","id":"i-099f794ff96165cb8","vm_type":"m4.xlarge","region":"cn-north-1","zone":"cn-north-1a","metadata":{"marketplaceProductCodes":null,"version":"2017-09-30","pendingTime":"2018-02-01T06:53:03Z","imageId":"ami-5906d034","architecture":"x86_64","kernelId":null,"ramdiskId":null}},"concurrent_connections":29718,"os":{"load":{"1m":0,"5m":0,"15m":0},"memory":{"total_in_bytes":16825540608,"free_in_bytes":2769485824,"used_in_bytes":14056054784},"uptime_in_millis":36023852000},"process":{"event_loop_delay":236254.61135864258,"memory":{"heap":{"total_in_bytes":147058688,"used_in_bytes":130770040,"size_limit":1501560832},"resident_set_size_in_bytes":224997376},"uptime_in_millis":1461197665},"requests":{"disconnects":0,"total":5448,"status_codes":{"200":3521,"304":1926,"404":1}},"response_times":{"average":18985,"max":18985},"timestamp":"2019-03-25T05:30:50.166Z","kibana":{"uuid":"44bd6cf6-402c-4e72-8790-540f3272bbd6","name":"kibana","index":".kibana","host":"elk-master-stag","transport_address":"0:8080","version":"6.1.2","snapshot":false,"status":"green"},"usage":{"dashboard":{"total":0},"visualization":{"total":0},"search":{"total":0},"index_pattern":{"total":1},"index":".kibana","graph_workspace":{"total":0},"timelion_sheet":{"total":0},"xpack":{"reporting":{"available":true,"enabled":true,"_all":0,"csv":{"available":true,"total":0},"printable_pdf":{"available":false,"total":0}}}}}}]}]]]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:140) ~[?:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_151]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_151]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_151]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_151]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_151]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_151]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_151]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:141) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-kibana-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-kibana-6-2019.03.25][0]] containing [index {[.monitoring-kibana-6-2019.03.25][doc][To9Us2kB1ahbx08ChwtI], source[{"cluster_uuid":"sQpdXWdLRQq8FnlGMO3zfw","timestamp":"2019-03-25T05:30:55.174Z","interval_ms":10000,"type":"kibana_stats","source_node":{"uuid":"kFkRHqR1QzuFzNiEnp2tXw","host":"10.128.45.224","transport_address":"10.128.45.224:9300","ip":"10.128.45.224","name":"elk-master-stag","timestamp":"2019-03-25T05:30:55.174Z"},"kibana_stats":{"cloud":{"name":"aws","id":"i-099f794ff96165cb8","vm_type":"m4.xlarge","region":"cn-north-1","zone":"cn-north-1a","metadata":{"marketplaceProductCodes":null,"version":"2017-09-30","pendingTime":"2018-02-01T06:53:03Z","imageId":"ami-5906d034","architecture":"x86_64","kernelId":null,"ramdiskId":null}},"concurrent_connections":29718,"os":{"load":{"1m":0,"5m":0,"15m":0},"memory":{"total_in_bytes":16825540608,"free_in_bytes":2769485824,"used_in_bytes":14056054784},"uptime_in_millis":36023852000},"process":{"event_loop_delay":236254.61135864258,"memory":{"heap":{"total_in_bytes":147058688,"used_in_bytes":130770040,"size_limit":1501560832},"resident_set_size_in_bytes":224997376},"uptime_in_millis":1461197665},"requests":{"disconnects":0,"total":5448,"status_codes":{"200":3521,"304":1926,"404":1}},"response_times":{"average":18985,"max":18985},"timestamp":"2019-03-25T05:30:50.166Z","kibana":{"uuid":"44bd6cf6-402c-4e72-8790-540f3272bbd6","name":"kibana","index":".kibana","host":"elk-master-stag","transport_address":"0:8080","version":"6.1.2","snapshot":false,"status":"green"},"usage":{"dashboard":{"total":0},"visualization":{"total":0},"search":{"total":0},"index_pattern":{"total":1},"index":".kibana","graph_workspace":{"total":0},"timelion_sheet":{"total":0},"xpack":{"reporting":{"available":true,"enabled":true,"_all":0,"csv":{"available":true,"total":0},"printable_pdf":{"available":false,"total":0}}}}}}]}]]
... 12 more
[2019-03-25T05:31:56,482][INFO ][o.e.c.r.a.DiskThresholdMonitor] [elk-master-stag] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-03-25T05:31:56,531][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.monitoring-kibana-6-2019.03.25][0], [.monitoring-es-6-2019.03.25][0], [.monitoring-logstash-6-2019.03.25][0]] ...]).
[2019-03-25T05:31:57,070][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.25][0]] ...]).
[2019-03-25T05:32:56,507][INFO ][o.e.c.r.a.DiskThresholdMonitor] [elk-master-stag] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-03-26T00:00:01,225][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.26] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[1], mappings [doc]
[2019-03-26T00:00:01,285][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2019.03.26][0]] ...]).
[2019-03-26T00:00:04,612][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.26] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[1], mappings [doc]
[2019-03-26T00:00:04,667][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.26][0]] ...]).
[2019-03-26T00:00:06,920][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.26] creating index, cause [auto(bulk api)], templates [.monitoring-logstash], shards [1]/[1], mappings [doc]
[2019-03-26T00:00:06,982][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2019.03.26][0]] ...]).
[2019-03-26T01:00:00,000][INFO ][o.e.x.m.e.l.LocalExporter] cleaning up [3] old indices
[2019-03-26T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.19/trbpca69QIKP1wvtpVacfQ] deleting index
[2019-03-26T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.19/a-ok7dW1SnO5s-X-W21fsg] deleting index
[2019-03-26T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.19/DBe2Jc7uRAClciPTB6YQIg] deleting index
[2019-03-26T02:00:01,594][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [logstash-2019.03.19/hUdivvX-T7CqMZ6_uVed-A] deleting index
[2019-03-26T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
[2019-03-26T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Deleting expired data
[2019-03-26T02:22:00,001][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Completed deletion of expired data
[2019-03-26T02:22:00,001][INFO ][o.e.x.m.MlDailyMaintenanceService] Successfully completed [ML] maintenance tasks
[2019-03-27T00:00:01,508][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.27] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[1], mappings [doc]
[2019-03-27T00:00:01,570][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2019.03.27][0]] ...]).
[2019-03-27T00:00:02,977][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.27] creating index, cause [auto(bulk api)], templates [.monitoring-logstash], shards [1]/[1], mappings [doc]
[2019-03-27T00:00:03,038][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2019.03.27][0]] ...]).
[2019-03-27T00:00:09,733][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.27] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[1], mappings [doc]
[2019-03-27T00:00:09,789][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.27][0]] ...]).
[2019-03-27T01:00:00,000][INFO ][o.e.x.m.e.l.LocalExporter] cleaning up [3] old indices
[2019-03-27T01:00:00,000][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.20/W8oomeMWTcC6X1SbmHkl2A] deleting index
[2019-03-27T01:00:00,000][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.20/f3-Ri53eSNuWjP7axbl6wQ] deleting index
[2019-03-27T01:00:00,000][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.20/ZzlLDVU1Q5CcOVenbEhQUA] deleting index
[2019-03-27T02:00:02,284][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [logstash-2019.03.20/fi7ByetIQGqYqtnGKqRPSw] deleting index
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Deleting expired data
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Completed deletion of expired data
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] Successfully completed [ML] maintenance tasks
[2019-03-27T07:13:42,882][WARN ][o.e.d.a.a.i.t.p.PutIndexTemplateRequest] Deprecated field [template] used, replaced by [index_patterns]
[2019-03-28T00:00:03,324][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.28] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[1], mappings [doc]
[2019-03-28T00:00:03,385][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2019.03.28][0]] ...]).
[2019-03-28T00:00:05,379][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.28] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[1], mappings [doc]
[2019-03-28T00:00:05,453][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.28][0]] ...]).
[2019-03-28T00:00:09,308][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.28] creating index, cause [auto(bulk api)], templates [.monitoring-logstash], shards [1]/[1], mappings [doc]
[2019-03-28T00:00:09,369][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2019.03.28][0]] ...]).
[2019-03-28T01:00:00,000][INFO ][o.e.x.m.e.l.LocalExporter] cleaning up [3] old indices
[2019-03-28T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.21/GzSbw7LVT9qSfZNT2jn6-Q] deleting index
[2019-03-28T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.21/cjmEdz8oToai_guyoNOLxA] deleting index
[2019-03-28T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.21/KRqQ2Pt6RS65lmH8ynyVwQ] deleting index
[2019-03-28T02:00:01,541][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [logstash-2019.03.21/dPjrTlVmQz6B4vsZKuMfyA] deleting index
[2019-03-28T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
[2019-03-28T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Deleting expired data
[2019-03-28T02:22:00,001][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Completed deletion of expired data
[2019-03-28T02:22:00,001][INFO ][o.e.x.m.MlDailyMaintenanceService] Successfully completed [ML] maintenance tasks
查了相关资料(1,2)发现,问题原因应该是触发了磁盘阈值,导致 es 中的 index 变成了 read only 了;
Once you hit the flood stage watermark (95% by default), you must manually release the index lock — see https://www.elastic.co/guide/en/elasticsearch/reference/6.2/disk-allocator.html . The following will do the trick, which you need to run for every affected index:
PUT /<your-index>/_settings { "index.blocks.read_only_allow_delete": null }
index.blocks.read_only_allow_delete
参数含义详见官方文档;
cluster.routing.allocation.disk.watermark.flood_stage
Controls the flood stage watermark. It defaults to 95%, meaning that Elasticsearch enforces a read-only index block (
index.blocks.read_only_allow_delete
) on every index that has one or more shards allocated on the node that has at least one disk exceeding the flood stage. This is a last resort to prevent nodes from running out of disk space. The index block must be released manually once there is enough disk space available to allow indexing operations to continue.
在 Dev Tools 上执行
GET _all/_settings
发现
...
".monitoring-es-6-2019.03.24": {
"settings": {
"index": {
"codec": "best_compression",
"number_of_shards": "1",
"blocks": {
"read_only_allow_delete": "true" -- true
},
"provided_name": ".monitoring-es-6-2019.03.24",
"format": "6",
"creation_date": "1553385602490",
"number_of_replicas": "1",
"uuid": "4ndbiB7PTQadbmylKABHcA",
"version": {
"created": "6010299"
}
}
}
},
...
".monitoring-es-6-2019.03.25": {
"settings": {
"index": {
"codec": "best_compression",
"number_of_shards": "1",
"provided_name": ".monitoring-es-6-2019.03.25",
"format": "6",
"creation_date": "1553472005023",
"number_of_replicas": "1",
"uuid": "FB8yF1qZQmCAddDlyk0lYQ",
"version": {
"created": "6010299"
}
}
}
}
执行如下命令进行重置
PUT /_all/_settings
{
"index.blocks.read_only_allow_delete": null
}
再次查看
GET _all/_settings
发现 index.blocks.read_only_allow_delete
的设置都没了;
片刻之后,es 日志输出如下内容,logstash 又重新开始工作了,kibana 上就能够查到数据了;
[2019-03-28T11:18:57,494][WARN ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Synchronous auto-commit of offsets {logs.appos.pipeline-2=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-67=OffsetAndMetadata{offset=7056685, metadata=''}, logs.k8s-34=OffsetAndMetadata{offset=7056970, metadata=''}, logs.k8s-1=OffsetAndMetadata{offset=782089692, metadata=''}, logs.k8s-92=OffsetAndMetadata{offset=6984906, metadata=''}, logs.k8s-59=OffsetAndMetadata{offset=6984937, metadata=''}, logs.k8s-26=OffsetAndMetadata{offset=6983137, metadata=''}, logs.k8s-84=OffsetAndMetadata{offset=6982360, metadata=''}, logs.k8s-51=OffsetAndMetadata{offset=6983778, metadata=''}, logs.k8s-18=OffsetAndMetadata{offset=6982519, metadata=''}, logs.http-7=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-76=OffsetAndMetadata{offset=7056057, metadata=''}, logs.k8s-43=OffsetAndMetadata{offset=7055205, metadata=''}, logs.k8s-10=OffsetAndMetadata{offset=7055728, metadata=''}, logs.k8s-68=OffsetAndMetadata{offset=6983793, metadata=''}, logs.k8s-35=OffsetAndMetadata{offset=6985012, metadata=''}, logs.http-9=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-2=OffsetAndMetadata{offset=6984847, metadata=''}, logs.k8s-93=OffsetAndMetadata{offset=6978598, metadata=''}, logs.k8s-60=OffsetAndMetadata{offset=6980707, metadata=''}, logs.k8s-27=OffsetAndMetadata{offset=6981931, metadata=''}, logs.http-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-85=OffsetAndMetadata{offset=7057733, metadata=''}, logs.k8s-52=OffsetAndMetadata{offset=7056351, metadata=''}, logs.k8s-19=OffsetAndMetadata{offset=7058517, metadata=''}, logs.http-8=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-77=OffsetAndMetadata{offset=6984256, metadata=''}, logs.k8s-44=OffsetAndMetadata{offset=6982969, metadata=''}, logs.k8s-11=OffsetAndMetadata{offset=6983483, metadata=''}, logs.appos.pipeline-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-69=OffsetAndMetadata{offset=6981050, metadata=''}, logs.k8s-36=OffsetAndMetadata{offset=6980375, metadata=''}, logs.k8s-3=OffsetAndMetadata{offset=6981172, metadata=''}, logs.k8s-94=OffsetAndMetadata{offset=7057371, metadata=''}, logs.k8s-61=OffsetAndMetadata{offset=7057219, metadata=''}, logs.k8s-28=OffsetAndMetadata{offset=7055964, metadata=''}, logs.k8s-86=OffsetAndMetadata{offset=6984709, metadata=''}, logs.k8s-53=OffsetAndMetadata{offset=6983314, metadata=''}, logs.k8s-20=OffsetAndMetadata{offset=6984123, metadata=''}, logs.k8s-78=OffsetAndMetadata{offset=6981693, metadata=''}, logs.k8s-45=OffsetAndMetadata{offset=6981792, metadata=''}, logs.k8s-12=OffsetAndMetadata{offset=6981421, metadata=''}, logs.http-1=OffsetAndMetadata{offset=0, metadata=''}, logs.appos.pipeline-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-70=OffsetAndMetadata{offset=7054904, metadata=''}, logs.k8s-37=OffsetAndMetadata{offset=7057406, metadata=''}, logs.k8s-4=OffsetAndMetadata{offset=7056521, metadata=''}, logs.k8s-95=OffsetAndMetadata{offset=6981881, metadata=''}, logs.k8s-62=OffsetAndMetadata{offset=6981646, metadata=''}, logs.k8s-29=OffsetAndMetadata{offset=6983158, metadata=''}, logs.k8s-87=OffsetAndMetadata{offset=6982329, metadata=''}, logs.k8s-54=OffsetAndMetadata{offset=6982191, metadata=''}, logs.k8s-21=OffsetAndMetadata{offset=6984038, metadata=''}, logs.k8s-79=OffsetAndMetadata{offset=7056344, metadata=''}, logs.k8s-46=OffsetAndMetadata{offset=7059182, metadata=''}, logs.k8s-13=OffsetAndMetadata{offset=7054272, metadata=''}, logs.http-2=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-71=OffsetAndMetadata{offset=6985299, metadata=''}, logs.k8s-38=OffsetAndMetadata{offset=6986246, metadata=''}, logs.k8s-5=OffsetAndMetadata{offset=6984215, metadata=''}, logs.k8s-96=OffsetAndMetadata{offset=6981458, metadata=''}, logs.k8s-63=OffsetAndMetadata{offset=6982017, metadata=''}, logs.k8s-30=OffsetAndMetadata{offset=6983126, metadata=''}, logs.k8s-88=OffsetAndMetadata{offset=7055617, metadata=''}, logs.k8s-55=OffsetAndMetadata{offset=7057937, metadata=''}, logs.k8s-22=OffsetAndMetadata{offset=7054966, metadata=''}, logs.misc-0=OffsetAndMetadata{offset=0, metadata=''}, logs.systemd-2=OffsetAndMetadata{offset=859987, metadata=''}, logs.k8s-80=OffsetAndMetadata{offset=6985906, metadata=''}, logs.k8s-47=OffsetAndMetadata{offset=6983938, metadata=''}, logs.neo-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-14=OffsetAndMetadata{offset=6983119, metadata=''}, logs.http-3=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-72=OffsetAndMetadata{offset=6982493, metadata=''}, logs.k8s-39=OffsetAndMetadata{offset=6982291, metadata=''}, logs.k8s-6=OffsetAndMetadata{offset=6982752, metadata=''}, logs.k8s-97=OffsetAndMetadata{offset=7055999, metadata=''}, logs.k8s-64=OffsetAndMetadata{offset=7055740, metadata=''}, logs.k8s-31=OffsetAndMetadata{offset=7058299, metadata=''}, logs.cc-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-89=OffsetAndMetadata{offset=6983828, metadata=''}, logs.misc-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-56=OffsetAndMetadata{offset=6984963, metadata=''}, logs.k8s-23=OffsetAndMetadata{offset=6984679, metadata=''}, logs.cc-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-81=OffsetAndMetadata{offset=6981834, metadata=''}, logs.k8s-48=OffsetAndMetadata{offset=6983078, metadata=''}, logs.neo-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-15=OffsetAndMetadata{offset=6981419, metadata=''}, logs.http-4=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-73=OffsetAndMetadata{offset=7059005, metadata=''}, logs.k8s-40=OffsetAndMetadata{offset=7057329, metadata=''}, logs.k8s-7=OffsetAndMetadata{offset=7056235, metadata=''}, logs.leader_board-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-98=OffsetAndMetadata{offset=6982925, metadata=''}, logs.k8s-65=OffsetAndMetadata{offset=6985358, metadata=''}, logs.k8s-32=OffsetAndMetadata{offset=6983886, metadata=''}, logs.k8s-90=OffsetAndMetadata{offset=6982573, metadata=''}, logs.k8s-57=OffsetAndMetadata{offset=6982881, metadata=''}, logs.k8s-24=OffsetAndMetadata{offset=6981934, metadata=''}, logs.k8s-82=OffsetAndMetadata{offset=7057278, metadata=''}, logs.k8s-49=OffsetAndMetadata{offset=7056137, metadata=''}, logs.k8s-16=OffsetAndMetadata{offset=7057753, metadata=''}, logs.http-5=OffsetAndMetadata{offset=0, metadata=''}, logs.systemd-0=OffsetAndMetadata{offset=858587, metadata=''}, logs.k8s-74=OffsetAndMetadata{offset=6985753, metadata=''}, logs.k8s-41=OffsetAndMetadata{offset=6983987, metadata=''}, logs.k8s-8=OffsetAndMetadata{offset=6983851, metadata=''}, logs.k8s-99=OffsetAndMetadata{offset=6982844, metadata=''}, logs.k8s-66=OffsetAndMetadata{offset=6981875, metadata=''}, logs.k8s-33=OffsetAndMetadata{offset=6980478, metadata=''}, logs.k8s-0=OffsetAndMetadata{offset=4150994620, metadata=''}, logs.k8s-91=OffsetAndMetadata{offset=7057557, metadata=''}, logs.k8s-58=OffsetAndMetadata{offset=7057478, metadata=''}, logs.k8s-25=OffsetAndMetadata{offset=7054267, metadata=''}, logs.k8s-83=OffsetAndMetadata{offset=6981121, metadata=''}, logs.k8s-50=OffsetAndMetadata{offset=6983899, metadata=''}, logs.k8s-17=OffsetAndMetadata{offset=6984657, metadata=''}, logs.http-6=OffsetAndMetadata{offset=0, metadata=''}, logs.leader_board-0=OffsetAndMetadata{offset=0, metadata=''}, logs.systemd-1=OffsetAndMetadata{offset=360642, metadata=''}, logs.k8s-75=OffsetAndMetadata{offset=6983399, metadata=''}, logs.k8s-42=OffsetAndMetadata{offset=6981837, metadata=''}, logs.k8s-9=OffsetAndMetadata{offset=6983050, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
[2019-03-28T11:18:57,496][INFO ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Revoking previously assigned partitions [logs.appos.pipeline-2, logs.k8s-67, logs.k8s-34, logs.k8s-1, logs.k8s-92, logs.k8s-59, logs.k8s-26, logs.k8s-84, logs.k8s-51, logs.k8s-18, logs.http-7, logs.k8s-76, logs.k8s-43, logs.k8s-10, logs.k8s-68, logs.k8s-35, logs.http-9, logs.k8s-2, logs.k8s-93, logs.k8s-60, logs.k8s-27, logs.http-0, logs.k8s-85, logs.k8s-52, logs.k8s-19, logs.http-8, logs.k8s-77, logs.k8s-44, logs.k8s-11, logs.appos.pipeline-0, logs.k8s-69, logs.k8s-36, logs.k8s-3, logs.k8s-94, logs.k8s-61, logs.k8s-28, logs.k8s-86, logs.k8s-53, logs.k8s-20, logs.k8s-78, logs.k8s-45, logs.k8s-12, logs.http-1, logs.appos.pipeline-1, logs.k8s-70, logs.k8s-37, logs.k8s-4, logs.k8s-95, logs.k8s-62, logs.k8s-29, logs.k8s-87, logs.k8s-54, logs.k8s-21, logs.k8s-79, logs.k8s-46, logs.k8s-13, logs.http-2, logs.k8s-71, logs.k8s-38, logs.k8s-5, logs.k8s-96, logs.k8s-63, logs.k8s-30, logs.k8s-88, logs.k8s-55, logs.k8s-22, logs.misc-0, logs.systemd-2, logs.k8s-80, logs.k8s-47, logs.neo-0, logs.k8s-14, logs.http-3, logs.k8s-72, logs.k8s-39, logs.k8s-6, logs.k8s-97, logs.k8s-64, logs.k8s-31, logs.cc-0, logs.k8s-89, logs.misc-1, logs.k8s-56, logs.k8s-23, logs.cc-1, logs.k8s-81, logs.k8s-48, logs.neo-1, logs.k8s-15, logs.http-4, logs.k8s-73, logs.k8s-40, logs.k8s-7, logs.leader_board-1, logs.k8s-98, logs.k8s-65, logs.k8s-32, logs.k8s-90, logs.k8s-57, logs.k8s-24, logs.k8s-82, logs.k8s-49, logs.k8s-16, logs.http-5, logs.systemd-0, logs.k8s-74, logs.k8s-41, logs.k8s-8, logs.k8s-99, logs.k8s-66, logs.k8s-33, logs.k8s-0, logs.k8s-91, logs.k8s-58, logs.k8s-25, logs.k8s-83, logs.k8s-50, logs.k8s-17, logs.http-6, logs.leader_board-0, logs.systemd-1, logs.k8s-75, logs.k8s-42, logs.k8s-9]
[2019-03-28T11:18:57,504][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] (Re-)joining group
[2019-03-28T11:18:57,508][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Marking the coordinator 10.128.40.77:9092 (id: 2147483647 rack: null) dead
[2019-03-28T11:18:57,613][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Discovered coordinator 10.128.40.77:9092 (id: 2147483647 rack: null)
[2019-03-28T11:18:57,615][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] (Re-)joining group
[2019-03-28T11:19:00,686][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Successfully joined group with generation 1
[2019-03-28T11:19:00,687][INFO ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Setting newly assigned partitions [logs.appos.pipeline-2, logs.k8s-67, logs.k8s-34, logs.k8s-1, logs.k8s-92, logs.k8s-59, logs.k8s-26, logs.k8s-84, logs.k8s-51, logs.k8s-18, logs.http-7, logs.k8s-76, logs.k8s-43, logs.k8s-10, logs.k8s-68, logs.k8s-35, logs.k8s-2, logs.http-9, logs.k8s-93, logs.k8s-60, logs.k8s-27, logs.http-0, logs.k8s-85, logs.k8s-52, logs.k8s-19, logs.http-8, logs.k8s-77, logs.k8s-44, logs.k8s-11, logs.appos.pipeline-0, logs.k8s-69, logs.k8s-36, logs.k8s-3, logs.k8s-94, logs.k8s-61, logs.k8s-28, logs.k8s-86, logs.k8s-53, logs.k8s-20, logs.k8s-78, logs.k8s-45, logs.k8s-12, logs.http-1, logs.appos.pipeline-1, logs.k8s-70, logs.k8s-37, logs.k8s-4, logs.k8s-95, logs.k8s-62, logs.k8s-29, logs.k8s-87, logs.k8s-54, logs.k8s-21, logs.k8s-79, logs.k8s-46, logs.k8s-13, logs.http-2, logs.k8s-71, logs.k8s-38, logs.k8s-5, logs.k8s-96, logs.k8s-63, logs.k8s-30, logs.k8s-88, logs.k8s-55, logs.k8s-22, logs.misc-0, logs.k8s-80, logs.systemd-2, logs.k8s-47, logs.k8s-14, logs.neo-0, logs.http-3, logs.k8s-72, logs.k8s-39, logs.k8s-6, logs.k8s-97, logs.k8s-64, logs.k8s-31, logs.cc-0, logs.k8s-89, logs.k8s-56, logs.misc-1, logs.k8s-23, logs.cc-1, logs.k8s-81, logs.k8s-48, logs.k8s-15, logs.neo-1, logs.http-4, logs.k8s-73, logs.k8s-40, logs.k8s-7, logs.leader_board-1, logs.k8s-98, logs.k8s-65, logs.k8s-32, logs.k8s-90, logs.k8s-57, logs.k8s-24, logs.k8s-82, logs.k8s-49, logs.k8s-16, logs.http-5, logs.systemd-0, logs.k8s-74, logs.k8s-41, logs.k8s-8, logs.k8s-99, logs.k8s-66, logs.k8s-33, logs.k8s-0, logs.k8s-91, logs.k8s-58, logs.k8s-25, logs.k8s-83, logs.k8s-50, logs.k8s-17, logs.http-6, logs.systemd-1, logs.leader_board-0, logs.k8s-75, logs.k8s-42, logs.k8s-9]
业务人员:
liang 回答:
问题:
业务:
applog-inner-web-v009-scdbj
,刚好这条日志没在 kibana 上找到,今天 2018-07-02 09:28:11 的。kube UI 的日志里有的:{"error":"fail to get s3 file: Download failed, fail to get s3 token, key: log/darwin/20180702/MzY0MWQwMDAwMzUyOThkOQ==/3msvJ6FX5b5, error: rpc error: code = Unavailable desc = transport is closing","message":"fail to s3 url, commandId: 3msvJ6FX5b5","severity":"error","time":"2018-07-02T01:28:11.850705169Z"}
liang 回答:
logstash 中的“队列积满导致丢弃消息的日志”
[INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@2544ab29 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@65c54795[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 5711128047]]"})
sqlbuffet 上查到的内容
执行的查询命令
SELECT * FROM raw_data.logs_k8s WHERE json_record LIKE '%Download failed%' AND data_date = '2018-07-02' AND cluster_id = 'logs.k8s' LIMIT 10;
得到的结果
json_record | data_date | cluster_id | job_id |
---|---|---|---|
{"docker":{"container_id":"aac83cc0dad6d640091f9478bd61ee32787dfeaa1011317a90b04fccdc9f9725"},"error":"fail to get s3 file: Download failed, fail to get s3 token, key: log/darwin/20180702/MzY0MWQwMDAwMzUyOThkOQ==/3msvJ6FX5b5, error: rpc error: code = Unavailable desc = transport is closing","kubernetes":{"container_name":"backend-rls-applog","host":"ip-172-16-46-8.cn-north-1.compute.internal","labels":{"app":"applog","applog-inner-web":"true","cluster":"applog-inner-web","detail":"web","load-balancer-applog-prod-inner":"true","pod-template-hash":"481460618","replication-controller":"applog-inner-web-v009","stack":"inner","version":"9"},"master_url":"https://100.64.0.1:443/api","namespace_id":"7102570b-490a-11e8-b717-0278a9e12c2e","namespace_name":"backend","pod_id":"e7530f61-781b-11e8-b717-0278a9e12c2e","pod_name":"applog-inner-web-v009-scdbj"},"log":"{\"error\":\"fail to get s3 file: Download failed, fail to get s3 token, key: log/darwin/20180702/MzY0MWQwMDAwMzUyOThkOQ==/3msvJ6FX5b5, error: rpc error: code = Unavailable desc = transport is closing\",\"message\":\"fail to s3 url, commandId: 3msvJ6FX5b5\",\"severity\":\"error\",\"time\":\"2018-07-02T01:28:11.850705169Z\"}\n","message":"fail to s3 url, commandId: 3msvJ6FX5b5","severity":"error","stream":"stderr","time":"2018-07-02T01:28:11.850911451Z"} | 2018-07-02 | logs.k8s | overhear-sink-logs.k8s-2018-07-02.100011 |
解决办法:所以你那里可以先根据 sqlbuffet 做一个互补方案(1 小时同步延迟), elk 目前还无法保证完全不丢消息(配置超大的 queue 会因为队列堵塞,导致最新的查询不到)
liang:一旦磁盘触发保护机制,那么就会禁止新的数据进入 ES 系统,即便后面磁盘容量恢复后还是会有相关问题,需要手动恢复。
问题:
业务:kibana上大量相同内容的日志是怎么回事?按业务来说 这种日志一个用户一天只应该有一条
liang:可以到 sqlbuffet 查一下是否有重复的,如果没有重复的,那就是 elk retry 机制造成的,如果也有重复,那就是数据源产生数据可能就有问题了。
问题:
最近 yiwei.zheng 在看 kibana 上面的日志的时候, 发现有些请求会有多条 log, 比如
http://k.llsops.com/goto/fa0e655057dc4a217713f6f4e1bbc3cb
PS. 由于 kibana 上面的日志只保留最近几天,以上链接也可能会失效
比如图中 2539916820, response_time 都为 0.051, timestamp 也一样,无法区分是同一个请求日志重复,还是同一时间的不同的请求日志;
另外,由于日志做为 debug 时重要的信息来源,所以需要确认
nginx 最多精确到 ms ,详见 http://nginx.org/en/docs/http/ngx_http_log_module.html
在 nprod 试了下, 目前 elk 的配置似乎兼容了 ms 的格式, 应该不需要再改配置
ms 应该不能完全区分请求;通过 https://docs.konghq.com/plugins/correlation-id/ 可以区分不同 request,但是不能完全判定请求的先后顺序;如果需要用这个插件再讨论;
liang:之前有发现过存在一个原因是时间戳是 seconds 级别,如果调整到 ms 或者 us 就不会重复了。
2018-01-15T10:36:32.795+0800 INFO zap/server_interceptors.go:39 finished unary call {"grpc.start_time": "2018-01-15T10:36:32+08:00", "system": "grpc", "span.kind": "server", "grpc.service": "hexley.CourseCenter", "grpc.method": "GetCourse", "grpc.code": "OK", "grpc.duration": 0}
liang:是不是可以不把 service rpc 请求的所有日志都通过 ELK 进行索引?latency 通过 metrics 不是就可以看到了吗?这样的日志没有发现有什么价值。。。
我打算后面消息有堆积的话,直接把 rpc request 过滤掉了
业务:感觉我们现在 ELK 里的 neo nginx log 延时比以前要高太多了。比如现在只能查到昨天中午12:00的,有大半天的延时。印象中原来几乎是实时的,因为我原来经常通过这个查用户是否出席直播课。
tony:不太确定是不是 fluentd 的问题,不过看 tail 的 pos,应该是一直在更新的;看 fluentd 的 pos,应该是比较新的,估计要看下 kafka 了;
liang:我记得当初是因为 fluentd kafka plugin 没有开启 buffer 上传,所以效率非常低,并且跟 kafka 连接数很高,后来优化了 kafka 开启了 buffer 就没有出现这么大延迟了。
问题:
这两个应该是同一个东西,fluentd 有一个专门记录 position 的文件;
zee:staging elk 的硬盘满了, es 用的... 哪里可以删掉点数据吗
liang:原因是定时清除脚本版本比较老,和 es 不兼容,我会修改一下
Js->LogServer->ELK
可能的问题点: