moooofly / MarkSomethingDownLLS

本仓库用于记录自 2017年10月16日起,在英语流利说任职期间的各方面知识汇总(以下内容不足以体现全部,一些敏感内容已移除)~
MIT License
72 stars 37 forks source link

ELK 问题集锦 #64

Open moooofly opened 5 years ago

moooofly commented 5 years ago

可能的问题点:



关联:https://wiki.liulishuo.work/pages/viewpage.action?pageId=42139346

moooofly commented 5 years ago

Case 1 - 日志“削平”问题

结论:原因是 td-agent 在高峰期处理日志的能力远低于 kong qps ,高峰日志被 log rotate 了;

隐藏信息:

例如,假设当前生成的日志文件为 kong.log ,那么被 rotate 过的文件会被命名为 kong.log.1 等;而 td-agent 应该是(猜测)只会读取 kong.log 日志的内容,因此,一旦出现,日志量过大的情况,必然会导致 rotate 的频率变高,与此同时 tg-agent 对日志的处理也会耗费大量的 CPU ,量大会导致处理不过来的情况;可能导致的情况就是,当前 kong.log 还没有处理完,就已经触发了 rotate 动作,之后 td-agent 转为处理新的 kong.log ,而原来的 kong.log 中的内容可能还有一部分没有被处理完;

http://hk.llsops.com/app/kibana#/home?_g=() 上可以看到,高峰期日志量又被“削平”情况

image

image

image

要想看出这种“削平”问题,时间粒度必须要小,否则看起来容易被忽略;

image

image

image

moooofly commented 5 years ago

Case 2 - 日志量过大,查询失败问题

目前的情况是,查询 7 天的数据就比较费劲,很可能就查询超时了;

image

查询超过 30s 就被阻止了;

Discover: Request Timeout after 30000ms

比较靠谱的做法是,在查询时指定 filter ,参照这里&_a=(columns:!(kubernetes.container_name,log),filters:!((%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%277921b670-1846-11e8-b5df-f570976454ee%27,key:kubernetes.container_name,negate:!f,params:(query:algorithm-rls-kprecommender-grader,type:phrase),type:phrase,value:algorithm-rls-kprecommender-grader),query:(match:(kubernetes.container_name:(query:algorithm-rls-kprecommender-grader,type:phrase))))),index:%277921b670-1846-11e8-b5df-f570976454ee%27,interval:auto,query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:%27%27,query:%27%27))),sort:!(%27@timestamp%27,desc),uiState:(spy:(mode:(fill:!f,name:!n)))))

image

liang: 如果日质量比较大,或者用户的查询条件比较低效(比如用到了 wildcard 查询),因为 es 里面是有 query queue 的,那么很有可能会导致 query queue 满了,并且前端没有错误提示,但是只返回少量的结果,这种情况可能会导致误以为是 index 部分有延迟,但是实际 index 是没有问题,只是查询 partil failed 而已。可以在 kibana monitoring 里面查看各项指标来进一步确认(http://k.llsops.com/app/monitoring#/overview)

moooofly commented 5 years ago

Case 3 - 一些应用日志打的过多问题

关于 ELK

image

业务人员的说法:

问题:

moooofly commented 5 years ago

Case 4 - Kibana 上看不到数据问题

0x03 异常高峰流量导致触发磁盘阈值进而导致 es index 变为 read only

业务人员报告:stag-k 上从 3月24日起看不到日志了;

image

3月28日的日志是调整后才出现的;

image

3月28日排查问题时发现,logstash 的 Events Received Rate 和 Events Emitted Rate 都为 0 ;同时 logstash 的 CPU Usage 和 Load average 也都为 0 ;所以,首先怀疑了是否 kafka 出了什么问题,导致消息没有再转发给 logstash ;

经过 kafka 运维同学排查,发现“我看到 x-logstash 这个consumer-group不在”(这个为啥会不在,原因不确定),并确定 kafka 没有啥问题;

重新排查 elk 这边的情况,发现

[2019-03-28T08:08:00,385][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2019-03-28T08:08:00,385][INFO ][logstash.outputs.elasticsearch] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>125}
[2019-03-25T05:31:45,487][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:140) ~[?:?]
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_151]
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_151]
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_151]
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_151]
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_151]
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_151]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:141) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]
    ... 12 more
[2019-03-25T05:31:45,488][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:140) ~[?:?]
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_151]
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_151]
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_151]
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_151]
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_151]
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_151]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:141) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-6-2019.03.25][0]] containing [133] requests]
    ... 12 more
[2019-03-25T05:31:45,489][WARN ][o.e.x.m.MonitoringService] [elk-master-stag] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:211) ~[?:?]
    at org.elasticsearch.xpack.common.IteratingActionListener.onResponse(IteratingActionListener.java:108) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:175) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:147) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:167) ~[?:?]
    ... 26 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:136) ~[?:?]
    ... 24 more
[2019-03-25T05:31:55,176][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-kibana-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-kibana-6-2019.03.25][0]] containing [index {[.monitoring-kibana-6-2019.03.25][doc][To9Us2kB1ahbx08ChwtI], source[{"cluster_uuid":"sQpdXWdLRQq8FnlGMO3zfw","timestamp":"2019-03-25T05:30:55.174Z","interval_ms":10000,"type":"kibana_stats","source_node":{"uuid":"kFkRHqR1QzuFzNiEnp2tXw","host":"10.128.45.224","transport_address":"10.128.45.224:9300","ip":"10.128.45.224","name":"elk-master-stag","timestamp":"2019-03-25T05:30:55.174Z"},"kibana_stats":{"cloud":{"name":"aws","id":"i-099f794ff96165cb8","vm_type":"m4.xlarge","region":"cn-north-1","zone":"cn-north-1a","metadata":{"marketplaceProductCodes":null,"version":"2017-09-30","pendingTime":"2018-02-01T06:53:03Z","imageId":"ami-5906d034","architecture":"x86_64","kernelId":null,"ramdiskId":null}},"concurrent_connections":29718,"os":{"load":{"1m":0,"5m":0,"15m":0},"memory":{"total_in_bytes":16825540608,"free_in_bytes":2769485824,"used_in_bytes":14056054784},"uptime_in_millis":36023852000},"process":{"event_loop_delay":236254.61135864258,"memory":{"heap":{"total_in_bytes":147058688,"used_in_bytes":130770040,"size_limit":1501560832},"resident_set_size_in_bytes":224997376},"uptime_in_millis":1461197665},"requests":{"disconnects":0,"total":5448,"status_codes":{"200":3521,"304":1926,"404":1}},"response_times":{"average":18985,"max":18985},"timestamp":"2019-03-25T05:30:50.166Z","kibana":{"uuid":"44bd6cf6-402c-4e72-8790-540f3272bbd6","name":"kibana","index":".kibana","host":"elk-master-stag","transport_address":"0:8080","version":"6.1.2","snapshot":false,"status":"green"},"usage":{"dashboard":{"total":0},"visualization":{"total":0},"search":{"total":0},"index_pattern":{"total":1},"index":".kibana","graph_workspace":{"total":0},"timelion_sheet":{"total":0},"xpack":{"reporting":{"available":true,"enabled":true,"_all":0,"csv":{"available":true,"total":0},"printable_pdf":{"available":false,"total":0}}}}}}]}]]]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:140) ~[?:?]
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_151]
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_151]
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_151]
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_151]
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_151]
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_151]
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_151]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:141) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:123) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:571) ~[elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:380) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:375) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:908) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:878) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:943) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:780) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:733) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:897) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-kibana-6-2019.03.25][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-kibana-6-2019.03.25][0]] containing [index {[.monitoring-kibana-6-2019.03.25][doc][To9Us2kB1ahbx08ChwtI], source[{"cluster_uuid":"sQpdXWdLRQq8FnlGMO3zfw","timestamp":"2019-03-25T05:30:55.174Z","interval_ms":10000,"type":"kibana_stats","source_node":{"uuid":"kFkRHqR1QzuFzNiEnp2tXw","host":"10.128.45.224","transport_address":"10.128.45.224:9300","ip":"10.128.45.224","name":"elk-master-stag","timestamp":"2019-03-25T05:30:55.174Z"},"kibana_stats":{"cloud":{"name":"aws","id":"i-099f794ff96165cb8","vm_type":"m4.xlarge","region":"cn-north-1","zone":"cn-north-1a","metadata":{"marketplaceProductCodes":null,"version":"2017-09-30","pendingTime":"2018-02-01T06:53:03Z","imageId":"ami-5906d034","architecture":"x86_64","kernelId":null,"ramdiskId":null}},"concurrent_connections":29718,"os":{"load":{"1m":0,"5m":0,"15m":0},"memory":{"total_in_bytes":16825540608,"free_in_bytes":2769485824,"used_in_bytes":14056054784},"uptime_in_millis":36023852000},"process":{"event_loop_delay":236254.61135864258,"memory":{"heap":{"total_in_bytes":147058688,"used_in_bytes":130770040,"size_limit":1501560832},"resident_set_size_in_bytes":224997376},"uptime_in_millis":1461197665},"requests":{"disconnects":0,"total":5448,"status_codes":{"200":3521,"304":1926,"404":1}},"response_times":{"average":18985,"max":18985},"timestamp":"2019-03-25T05:30:50.166Z","kibana":{"uuid":"44bd6cf6-402c-4e72-8790-540f3272bbd6","name":"kibana","index":".kibana","host":"elk-master-stag","transport_address":"0:8080","version":"6.1.2","snapshot":false,"status":"green"},"usage":{"dashboard":{"total":0},"visualization":{"total":0},"search":{"total":0},"index_pattern":{"total":1},"index":".kibana","graph_workspace":{"total":0},"timelion_sheet":{"total":0},"xpack":{"reporting":{"available":true,"enabled":true,"_all":0,"csv":{"available":true,"total":0},"printable_pdf":{"available":false,"total":0}}}}}}]}]]
    ... 12 more
[2019-03-25T05:31:56,482][INFO ][o.e.c.r.a.DiskThresholdMonitor] [elk-master-stag] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-03-25T05:31:56,531][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.monitoring-kibana-6-2019.03.25][0], [.monitoring-es-6-2019.03.25][0], [.monitoring-logstash-6-2019.03.25][0]] ...]).
[2019-03-25T05:31:57,070][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.25][0]] ...]).
[2019-03-25T05:32:56,507][INFO ][o.e.c.r.a.DiskThresholdMonitor] [elk-master-stag] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-03-26T00:00:01,225][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.26] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[1], mappings [doc]
[2019-03-26T00:00:01,285][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2019.03.26][0]] ...]).
[2019-03-26T00:00:04,612][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.26] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[1], mappings [doc]
[2019-03-26T00:00:04,667][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.26][0]] ...]).
[2019-03-26T00:00:06,920][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.26] creating index, cause [auto(bulk api)], templates [.monitoring-logstash], shards [1]/[1], mappings [doc]
[2019-03-26T00:00:06,982][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2019.03.26][0]] ...]).
[2019-03-26T01:00:00,000][INFO ][o.e.x.m.e.l.LocalExporter] cleaning up [3] old indices
[2019-03-26T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.19/trbpca69QIKP1wvtpVacfQ] deleting index
[2019-03-26T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.19/a-ok7dW1SnO5s-X-W21fsg] deleting index
[2019-03-26T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.19/DBe2Jc7uRAClciPTB6YQIg] deleting index
[2019-03-26T02:00:01,594][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [logstash-2019.03.19/hUdivvX-T7CqMZ6_uVed-A] deleting index
[2019-03-26T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
[2019-03-26T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Deleting expired data
[2019-03-26T02:22:00,001][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Completed deletion of expired data
[2019-03-26T02:22:00,001][INFO ][o.e.x.m.MlDailyMaintenanceService] Successfully completed [ML] maintenance tasks
[2019-03-27T00:00:01,508][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.27] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[1], mappings [doc]
[2019-03-27T00:00:01,570][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2019.03.27][0]] ...]).
[2019-03-27T00:00:02,977][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.27] creating index, cause [auto(bulk api)], templates [.monitoring-logstash], shards [1]/[1], mappings [doc]
[2019-03-27T00:00:03,038][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2019.03.27][0]] ...]).
[2019-03-27T00:00:09,733][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.27] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[1], mappings [doc]
[2019-03-27T00:00:09,789][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.27][0]] ...]).
[2019-03-27T01:00:00,000][INFO ][o.e.x.m.e.l.LocalExporter] cleaning up [3] old indices
[2019-03-27T01:00:00,000][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.20/W8oomeMWTcC6X1SbmHkl2A] deleting index
[2019-03-27T01:00:00,000][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.20/f3-Ri53eSNuWjP7axbl6wQ] deleting index
[2019-03-27T01:00:00,000][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.20/ZzlLDVU1Q5CcOVenbEhQUA] deleting index
[2019-03-27T02:00:02,284][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [logstash-2019.03.20/fi7ByetIQGqYqtnGKqRPSw] deleting index
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Deleting expired data
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Completed deletion of expired data
[2019-03-27T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] Successfully completed [ML] maintenance tasks
[2019-03-27T07:13:42,882][WARN ][o.e.d.a.a.i.t.p.PutIndexTemplateRequest] Deprecated field [template] used, replaced by [index_patterns]
[2019-03-28T00:00:03,324][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.28] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[1], mappings [doc]
[2019-03-28T00:00:03,385][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2019.03.28][0]] ...]).
[2019-03-28T00:00:05,379][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.28] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[1], mappings [doc]
[2019-03-28T00:00:05,453][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2019.03.28][0]] ...]).
[2019-03-28T00:00:09,308][INFO ][o.e.c.m.MetaDataCreateIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.28] creating index, cause [auto(bulk api)], templates [.monitoring-logstash], shards [1]/[1], mappings [doc]
[2019-03-28T00:00:09,369][INFO ][o.e.c.r.a.AllocationService] [elk-master-stag] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2019.03.28][0]] ...]).
[2019-03-28T01:00:00,000][INFO ][o.e.x.m.e.l.LocalExporter] cleaning up [3] old indices
[2019-03-28T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-es-6-2019.03.21/GzSbw7LVT9qSfZNT2jn6-Q] deleting index
[2019-03-28T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-logstash-6-2019.03.21/cjmEdz8oToai_guyoNOLxA] deleting index
[2019-03-28T01:00:00,001][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [.monitoring-kibana-6-2019.03.21/KRqQ2Pt6RS65lmH8ynyVwQ] deleting index
[2019-03-28T02:00:01,541][INFO ][o.e.c.m.MetaDataDeleteIndexService] [elk-master-stag] [logstash-2019.03.21/dPjrTlVmQz6B4vsZKuMfyA] deleting index
[2019-03-28T02:22:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
[2019-03-28T02:22:00,000][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Deleting expired data
[2019-03-28T02:22:00,001][INFO ][o.e.x.m.a.DeleteExpiredDataAction$TransportAction] [elk-master-stag] Completed deletion of expired data
[2019-03-28T02:22:00,001][INFO ][o.e.x.m.MlDailyMaintenanceService] Successfully completed [ML] maintenance tasks

查了相关资料(12)发现,问题原因应该是触发了磁盘阈值,导致 es 中的 index 变成了 read only 了

Once you hit the flood stage watermark (95% by default), you must manually release the index lock — see https://www.elastic.co/guide/en/elasticsearch/reference/6.2/disk-allocator.html . The following will do the trick, which you need to run for every affected index:

PUT /<your-index>/_settings
{
  "index.blocks.read_only_allow_delete": null
}

index.blocks.read_only_allow_delete 参数含义详见官方文档

cluster.routing.allocation.disk.watermark.flood_stage

Controls the flood stage watermark. It defaults to 95%, meaning that Elasticsearch enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node that has at least one disk exceeding the flood stage. This is a last resort to prevent nodes from running out of disk space. The index block must be released manually once there is enough disk space available to allow indexing operations to continue.

操作

在 Dev Tools 上执行

GET _all/_settings

发现

...
  ".monitoring-es-6-2019.03.24": {
    "settings": {
      "index": {
        "codec": "best_compression",
        "number_of_shards": "1",
        "blocks": {
          "read_only_allow_delete": "true"    -- true
        },
        "provided_name": ".monitoring-es-6-2019.03.24",
        "format": "6",
        "creation_date": "1553385602490",
        "number_of_replicas": "1",
        "uuid": "4ndbiB7PTQadbmylKABHcA",
        "version": {
          "created": "6010299"
        }
      }
    }
  },

...

  ".monitoring-es-6-2019.03.25": {
    "settings": {
      "index": {
        "codec": "best_compression",
        "number_of_shards": "1",
        "provided_name": ".monitoring-es-6-2019.03.25",
        "format": "6",
        "creation_date": "1553472005023",
        "number_of_replicas": "1",
        "uuid": "FB8yF1qZQmCAddDlyk0lYQ",
        "version": {
          "created": "6010299"
        }
      }
    }
  }

执行如下命令进行重置

PUT /_all/_settings
{
  "index.blocks.read_only_allow_delete": null
}

再次查看

GET _all/_settings

发现 index.blocks.read_only_allow_delete 的设置都没了;

片刻之后,es 日志输出如下内容,logstash 又重新开始工作了,kibana 上就能够查到数据了;

[2019-03-28T11:18:57,494][WARN ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Synchronous auto-commit of offsets {logs.appos.pipeline-2=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-67=OffsetAndMetadata{offset=7056685, metadata=''}, logs.k8s-34=OffsetAndMetadata{offset=7056970, metadata=''}, logs.k8s-1=OffsetAndMetadata{offset=782089692, metadata=''}, logs.k8s-92=OffsetAndMetadata{offset=6984906, metadata=''}, logs.k8s-59=OffsetAndMetadata{offset=6984937, metadata=''}, logs.k8s-26=OffsetAndMetadata{offset=6983137, metadata=''}, logs.k8s-84=OffsetAndMetadata{offset=6982360, metadata=''}, logs.k8s-51=OffsetAndMetadata{offset=6983778, metadata=''}, logs.k8s-18=OffsetAndMetadata{offset=6982519, metadata=''}, logs.http-7=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-76=OffsetAndMetadata{offset=7056057, metadata=''}, logs.k8s-43=OffsetAndMetadata{offset=7055205, metadata=''}, logs.k8s-10=OffsetAndMetadata{offset=7055728, metadata=''}, logs.k8s-68=OffsetAndMetadata{offset=6983793, metadata=''}, logs.k8s-35=OffsetAndMetadata{offset=6985012, metadata=''}, logs.http-9=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-2=OffsetAndMetadata{offset=6984847, metadata=''}, logs.k8s-93=OffsetAndMetadata{offset=6978598, metadata=''}, logs.k8s-60=OffsetAndMetadata{offset=6980707, metadata=''}, logs.k8s-27=OffsetAndMetadata{offset=6981931, metadata=''}, logs.http-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-85=OffsetAndMetadata{offset=7057733, metadata=''}, logs.k8s-52=OffsetAndMetadata{offset=7056351, metadata=''}, logs.k8s-19=OffsetAndMetadata{offset=7058517, metadata=''}, logs.http-8=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-77=OffsetAndMetadata{offset=6984256, metadata=''}, logs.k8s-44=OffsetAndMetadata{offset=6982969, metadata=''}, logs.k8s-11=OffsetAndMetadata{offset=6983483, metadata=''}, logs.appos.pipeline-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-69=OffsetAndMetadata{offset=6981050, metadata=''}, logs.k8s-36=OffsetAndMetadata{offset=6980375, metadata=''}, logs.k8s-3=OffsetAndMetadata{offset=6981172, metadata=''}, logs.k8s-94=OffsetAndMetadata{offset=7057371, metadata=''}, logs.k8s-61=OffsetAndMetadata{offset=7057219, metadata=''}, logs.k8s-28=OffsetAndMetadata{offset=7055964, metadata=''}, logs.k8s-86=OffsetAndMetadata{offset=6984709, metadata=''}, logs.k8s-53=OffsetAndMetadata{offset=6983314, metadata=''}, logs.k8s-20=OffsetAndMetadata{offset=6984123, metadata=''}, logs.k8s-78=OffsetAndMetadata{offset=6981693, metadata=''}, logs.k8s-45=OffsetAndMetadata{offset=6981792, metadata=''}, logs.k8s-12=OffsetAndMetadata{offset=6981421, metadata=''}, logs.http-1=OffsetAndMetadata{offset=0, metadata=''}, logs.appos.pipeline-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-70=OffsetAndMetadata{offset=7054904, metadata=''}, logs.k8s-37=OffsetAndMetadata{offset=7057406, metadata=''}, logs.k8s-4=OffsetAndMetadata{offset=7056521, metadata=''}, logs.k8s-95=OffsetAndMetadata{offset=6981881, metadata=''}, logs.k8s-62=OffsetAndMetadata{offset=6981646, metadata=''}, logs.k8s-29=OffsetAndMetadata{offset=6983158, metadata=''}, logs.k8s-87=OffsetAndMetadata{offset=6982329, metadata=''}, logs.k8s-54=OffsetAndMetadata{offset=6982191, metadata=''}, logs.k8s-21=OffsetAndMetadata{offset=6984038, metadata=''}, logs.k8s-79=OffsetAndMetadata{offset=7056344, metadata=''}, logs.k8s-46=OffsetAndMetadata{offset=7059182, metadata=''}, logs.k8s-13=OffsetAndMetadata{offset=7054272, metadata=''}, logs.http-2=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-71=OffsetAndMetadata{offset=6985299, metadata=''}, logs.k8s-38=OffsetAndMetadata{offset=6986246, metadata=''}, logs.k8s-5=OffsetAndMetadata{offset=6984215, metadata=''}, logs.k8s-96=OffsetAndMetadata{offset=6981458, metadata=''}, logs.k8s-63=OffsetAndMetadata{offset=6982017, metadata=''}, logs.k8s-30=OffsetAndMetadata{offset=6983126, metadata=''}, logs.k8s-88=OffsetAndMetadata{offset=7055617, metadata=''}, logs.k8s-55=OffsetAndMetadata{offset=7057937, metadata=''}, logs.k8s-22=OffsetAndMetadata{offset=7054966, metadata=''}, logs.misc-0=OffsetAndMetadata{offset=0, metadata=''}, logs.systemd-2=OffsetAndMetadata{offset=859987, metadata=''}, logs.k8s-80=OffsetAndMetadata{offset=6985906, metadata=''}, logs.k8s-47=OffsetAndMetadata{offset=6983938, metadata=''}, logs.neo-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-14=OffsetAndMetadata{offset=6983119, metadata=''}, logs.http-3=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-72=OffsetAndMetadata{offset=6982493, metadata=''}, logs.k8s-39=OffsetAndMetadata{offset=6982291, metadata=''}, logs.k8s-6=OffsetAndMetadata{offset=6982752, metadata=''}, logs.k8s-97=OffsetAndMetadata{offset=7055999, metadata=''}, logs.k8s-64=OffsetAndMetadata{offset=7055740, metadata=''}, logs.k8s-31=OffsetAndMetadata{offset=7058299, metadata=''}, logs.cc-0=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-89=OffsetAndMetadata{offset=6983828, metadata=''}, logs.misc-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-56=OffsetAndMetadata{offset=6984963, metadata=''}, logs.k8s-23=OffsetAndMetadata{offset=6984679, metadata=''}, logs.cc-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-81=OffsetAndMetadata{offset=6981834, metadata=''}, logs.k8s-48=OffsetAndMetadata{offset=6983078, metadata=''}, logs.neo-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-15=OffsetAndMetadata{offset=6981419, metadata=''}, logs.http-4=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-73=OffsetAndMetadata{offset=7059005, metadata=''}, logs.k8s-40=OffsetAndMetadata{offset=7057329, metadata=''}, logs.k8s-7=OffsetAndMetadata{offset=7056235, metadata=''}, logs.leader_board-1=OffsetAndMetadata{offset=0, metadata=''}, logs.k8s-98=OffsetAndMetadata{offset=6982925, metadata=''}, logs.k8s-65=OffsetAndMetadata{offset=6985358, metadata=''}, logs.k8s-32=OffsetAndMetadata{offset=6983886, metadata=''}, logs.k8s-90=OffsetAndMetadata{offset=6982573, metadata=''}, logs.k8s-57=OffsetAndMetadata{offset=6982881, metadata=''}, logs.k8s-24=OffsetAndMetadata{offset=6981934, metadata=''}, logs.k8s-82=OffsetAndMetadata{offset=7057278, metadata=''}, logs.k8s-49=OffsetAndMetadata{offset=7056137, metadata=''}, logs.k8s-16=OffsetAndMetadata{offset=7057753, metadata=''}, logs.http-5=OffsetAndMetadata{offset=0, metadata=''}, logs.systemd-0=OffsetAndMetadata{offset=858587, metadata=''}, logs.k8s-74=OffsetAndMetadata{offset=6985753, metadata=''}, logs.k8s-41=OffsetAndMetadata{offset=6983987, metadata=''}, logs.k8s-8=OffsetAndMetadata{offset=6983851, metadata=''}, logs.k8s-99=OffsetAndMetadata{offset=6982844, metadata=''}, logs.k8s-66=OffsetAndMetadata{offset=6981875, metadata=''}, logs.k8s-33=OffsetAndMetadata{offset=6980478, metadata=''}, logs.k8s-0=OffsetAndMetadata{offset=4150994620, metadata=''}, logs.k8s-91=OffsetAndMetadata{offset=7057557, metadata=''}, logs.k8s-58=OffsetAndMetadata{offset=7057478, metadata=''}, logs.k8s-25=OffsetAndMetadata{offset=7054267, metadata=''}, logs.k8s-83=OffsetAndMetadata{offset=6981121, metadata=''}, logs.k8s-50=OffsetAndMetadata{offset=6983899, metadata=''}, logs.k8s-17=OffsetAndMetadata{offset=6984657, metadata=''}, logs.http-6=OffsetAndMetadata{offset=0, metadata=''}, logs.leader_board-0=OffsetAndMetadata{offset=0, metadata=''}, logs.systemd-1=OffsetAndMetadata{offset=360642, metadata=''}, logs.k8s-75=OffsetAndMetadata{offset=6983399, metadata=''}, logs.k8s-42=OffsetAndMetadata{offset=6981837, metadata=''}, logs.k8s-9=OffsetAndMetadata{offset=6983050, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
[2019-03-28T11:18:57,496][INFO ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Revoking previously assigned partitions [logs.appos.pipeline-2, logs.k8s-67, logs.k8s-34, logs.k8s-1, logs.k8s-92, logs.k8s-59, logs.k8s-26, logs.k8s-84, logs.k8s-51, logs.k8s-18, logs.http-7, logs.k8s-76, logs.k8s-43, logs.k8s-10, logs.k8s-68, logs.k8s-35, logs.http-9, logs.k8s-2, logs.k8s-93, logs.k8s-60, logs.k8s-27, logs.http-0, logs.k8s-85, logs.k8s-52, logs.k8s-19, logs.http-8, logs.k8s-77, logs.k8s-44, logs.k8s-11, logs.appos.pipeline-0, logs.k8s-69, logs.k8s-36, logs.k8s-3, logs.k8s-94, logs.k8s-61, logs.k8s-28, logs.k8s-86, logs.k8s-53, logs.k8s-20, logs.k8s-78, logs.k8s-45, logs.k8s-12, logs.http-1, logs.appos.pipeline-1, logs.k8s-70, logs.k8s-37, logs.k8s-4, logs.k8s-95, logs.k8s-62, logs.k8s-29, logs.k8s-87, logs.k8s-54, logs.k8s-21, logs.k8s-79, logs.k8s-46, logs.k8s-13, logs.http-2, logs.k8s-71, logs.k8s-38, logs.k8s-5, logs.k8s-96, logs.k8s-63, logs.k8s-30, logs.k8s-88, logs.k8s-55, logs.k8s-22, logs.misc-0, logs.systemd-2, logs.k8s-80, logs.k8s-47, logs.neo-0, logs.k8s-14, logs.http-3, logs.k8s-72, logs.k8s-39, logs.k8s-6, logs.k8s-97, logs.k8s-64, logs.k8s-31, logs.cc-0, logs.k8s-89, logs.misc-1, logs.k8s-56, logs.k8s-23, logs.cc-1, logs.k8s-81, logs.k8s-48, logs.neo-1, logs.k8s-15, logs.http-4, logs.k8s-73, logs.k8s-40, logs.k8s-7, logs.leader_board-1, logs.k8s-98, logs.k8s-65, logs.k8s-32, logs.k8s-90, logs.k8s-57, logs.k8s-24, logs.k8s-82, logs.k8s-49, logs.k8s-16, logs.http-5, logs.systemd-0, logs.k8s-74, logs.k8s-41, logs.k8s-8, logs.k8s-99, logs.k8s-66, logs.k8s-33, logs.k8s-0, logs.k8s-91, logs.k8s-58, logs.k8s-25, logs.k8s-83, logs.k8s-50, logs.k8s-17, logs.http-6, logs.leader_board-0, logs.systemd-1, logs.k8s-75, logs.k8s-42, logs.k8s-9]
[2019-03-28T11:18:57,504][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] (Re-)joining group
[2019-03-28T11:18:57,508][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Marking the coordinator 10.128.40.77:9092 (id: 2147483647 rack: null) dead
[2019-03-28T11:18:57,613][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Discovered coordinator 10.128.40.77:9092 (id: 2147483647 rack: null)
[2019-03-28T11:18:57,615][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] (Re-)joining group
[2019-03-28T11:19:00,686][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Successfully joined group with generation 1
[2019-03-28T11:19:00,687][INFO ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [Consumer clientId=logstash-0, groupId=x-logstash] Setting newly assigned partitions [logs.appos.pipeline-2, logs.k8s-67, logs.k8s-34, logs.k8s-1, logs.k8s-92, logs.k8s-59, logs.k8s-26, logs.k8s-84, logs.k8s-51, logs.k8s-18, logs.http-7, logs.k8s-76, logs.k8s-43, logs.k8s-10, logs.k8s-68, logs.k8s-35, logs.k8s-2, logs.http-9, logs.k8s-93, logs.k8s-60, logs.k8s-27, logs.http-0, logs.k8s-85, logs.k8s-52, logs.k8s-19, logs.http-8, logs.k8s-77, logs.k8s-44, logs.k8s-11, logs.appos.pipeline-0, logs.k8s-69, logs.k8s-36, logs.k8s-3, logs.k8s-94, logs.k8s-61, logs.k8s-28, logs.k8s-86, logs.k8s-53, logs.k8s-20, logs.k8s-78, logs.k8s-45, logs.k8s-12, logs.http-1, logs.appos.pipeline-1, logs.k8s-70, logs.k8s-37, logs.k8s-4, logs.k8s-95, logs.k8s-62, logs.k8s-29, logs.k8s-87, logs.k8s-54, logs.k8s-21, logs.k8s-79, logs.k8s-46, logs.k8s-13, logs.http-2, logs.k8s-71, logs.k8s-38, logs.k8s-5, logs.k8s-96, logs.k8s-63, logs.k8s-30, logs.k8s-88, logs.k8s-55, logs.k8s-22, logs.misc-0, logs.k8s-80, logs.systemd-2, logs.k8s-47, logs.k8s-14, logs.neo-0, logs.http-3, logs.k8s-72, logs.k8s-39, logs.k8s-6, logs.k8s-97, logs.k8s-64, logs.k8s-31, logs.cc-0, logs.k8s-89, logs.k8s-56, logs.misc-1, logs.k8s-23, logs.cc-1, logs.k8s-81, logs.k8s-48, logs.k8s-15, logs.neo-1, logs.http-4, logs.k8s-73, logs.k8s-40, logs.k8s-7, logs.leader_board-1, logs.k8s-98, logs.k8s-65, logs.k8s-32, logs.k8s-90, logs.k8s-57, logs.k8s-24, logs.k8s-82, logs.k8s-49, logs.k8s-16, logs.http-5, logs.systemd-0, logs.k8s-74, logs.k8s-41, logs.k8s-8, logs.k8s-99, logs.k8s-66, logs.k8s-33, logs.k8s-0, logs.k8s-91, logs.k8s-58, logs.k8s-25, logs.k8s-83, logs.k8s-50, logs.k8s-17, logs.http-6, logs.systemd-1, logs.leader_board-0, logs.k8s-75, logs.k8s-42, logs.k8s-9]

0x01 - 磁盘容量达到上限触发保护

业务人员:

liang 回答:

image

问题:

0x02 - 日志在 kibana 上没找到,但在其他地方可以

业务:

{"error":"fail to get s3 file: Download failed, fail to get s3 token, key: log/darwin/20180702/MzY0MWQwMDAwMzUyOThkOQ==/3msvJ6FX5b5, error: rpc error: code = Unavailable desc = transport is closing","message":"fail to s3 url, commandId: 3msvJ6FX5b5","severity":"error","time":"2018-07-02T01:28:11.850705169Z"}

liang 回答:

logstash 中的“队列积满导致丢弃消息的日志”

[INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@2544ab29 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@65c54795[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 5711128047]]"})

sqlbuffet 上查到的内容

image

执行的查询命令

SELECT * FROM raw_data.logs_k8s WHERE json_record LIKE '%Download failed%' AND data_date = '2018-07-02' AND cluster_id = 'logs.k8s' LIMIT 10;

得到的结果

json_record data_date cluster_id job_id
{"docker":{"container_id":"aac83cc0dad6d640091f9478bd61ee32787dfeaa1011317a90b04fccdc9f9725"},"error":"fail to get s3 file: Download failed, fail to get s3 token, key: log/darwin/20180702/MzY0MWQwMDAwMzUyOThkOQ==/3msvJ6FX5b5, error: rpc error: code = Unavailable desc = transport is closing","kubernetes":{"container_name":"backend-rls-applog","host":"ip-172-16-46-8.cn-north-1.compute.internal","labels":{"app":"applog","applog-inner-web":"true","cluster":"applog-inner-web","detail":"web","load-balancer-applog-prod-inner":"true","pod-template-hash":"481460618","replication-controller":"applog-inner-web-v009","stack":"inner","version":"9"},"master_url":"https://100.64.0.1:443/api","namespace_id":"7102570b-490a-11e8-b717-0278a9e12c2e","namespace_name":"backend","pod_id":"e7530f61-781b-11e8-b717-0278a9e12c2e","pod_name":"applog-inner-web-v009-scdbj"},"log":"{\"error\":\"fail to get s3 file: Download failed, fail to get s3 token, key: log/darwin/20180702/MzY0MWQwMDAwMzUyOThkOQ==/3msvJ6FX5b5, error: rpc error: code = Unavailable desc = transport is closing\",\"message\":\"fail to s3 url, commandId: 3msvJ6FX5b5\",\"severity\":\"error\",\"time\":\"2018-07-02T01:28:11.850705169Z\"}\n","message":"fail to s3 url, commandId: 3msvJ6FX5b5","severity":"error","stream":"stderr","time":"2018-07-02T01:28:11.850911451Z"} 2018-07-02 logs.k8s overhear-sink-logs.k8s-2018-07-02.100011

解决办法:所以你那里可以先根据 sqlbuffet 做一个互补方案(1 小时同步延迟), elk 目前还无法保证完全不丢消息(配置超大的 queue 会因为队列堵塞,导致最新的查询不到)

liang:一旦磁盘触发保护机制,那么就会禁止新的数据进入 ES 系统,即便后面磁盘容量恢复后还是会有相关问题,需要手动恢复

问题:

moooofly commented 5 years ago

Case 5 - Kibana 上看到重复日志问题

业务:kibana上大量相同内容的日志是怎么回事?按业务来说 这种日志一个用户一天只应该有一条

image

liang:可以到 sqlbuffet 查一下是否有重复的,如果没有重复的,那就是 elk retry 机制造成的,如果也有重复,那就是数据源产生数据可能就有问题了。

问题:


ref: https://phab.llsapp.com/T63862

最近 yiwei.zheng 在看 kibana 上面的日志的时候, 发现有些请求会有多条 log, 比如

http://k.llsops.com/goto/fa0e655057dc4a217713f6f4e1bbc3cb

PS. 由于 kibana 上面的日志只保留最近几天,以上链接也可能会失效

image

比如图中 2539916820, response_time 都为 0.051, timestamp 也一样,无法区分是同一个请求日志重复,还是同一时间的不同的请求日志

另外,由于日志做为 debug 时重要的信息来源,所以需要确认


nginx 最多精确到 ms ,详见 http://nginx.org/en/docs/http/ngx_http_log_module.html

在 nprod 试了下, 目前 elk 的配置似乎兼容了 ms 的格式, 应该不需要再改配置

image

ms 应该不能完全区分请求;通过 https://docs.konghq.com/plugins/correlation-id/ 可以区分不同 request,但是不能完全判定请求的先后顺序;如果需要用这个插件再讨论;

liang:之前有发现过存在一个原因是时间戳是 seconds 级别,如果调整到 ms 或者 us 就不会重复了。

moooofly commented 5 years ago

Case 6 - 将没有价值的日志打入 ELK 问题

2018-01-15T10:36:32.795+0800    INFO    zap/server_interceptors.go:39    finished unary call    {"grpc.start_time": "2018-01-15T10:36:32+08:00", "system": "grpc", "span.kind": "server", "grpc.service": "hexley.CourseCenter", "grpc.method": "GetCourse", "grpc.code": "OK", "grpc.duration": 0}

liang:是不是可以不把 service rpc 请求的所有日志都通过 ELK 进行索引?latency 通过 metrics 不是就可以看到了吗?这样的日志没有发现有什么价值。。。

我打算后面消息有堆积的话,直接把 rpc request 过滤掉了

moooofly commented 5 years ago

Case 7 - ELK 中查看到的数据延时过大问题

0x01 kafka 吞吐量跟不上

业务:感觉我们现在 ELK 里的 neo nginx log 延时比以前要高太多了。比如现在只能查到昨天中午12:00的,有大半天的延时。印象中原来几乎是实时的,因为我原来经常通过这个查用户是否出席直播课。

image

tony:不太确定是不是 fluentd 的问题,不过看 tail 的 pos,应该是一直在更新的;看 fluentd 的 pos,应该是比较新的,估计要看下 kafka 了;

liang:我记得当初是因为 fluentd kafka plugin 没有开启 buffer 上传,所以效率非常低,并且跟 kafka 连接数很高,后来优化了 kafka 开启了 buffer 就没有出现这么大延迟了。

问题:

这两个应该是同一个东西,fluentd 有一个专门记录 position 的文件;

0x02 硬盘满了

zee:staging elk 的硬盘满了, es 用的... 哪里可以删掉点数据吗

liang:原因是定时清除脚本版本比较老,和 es 不兼容,我会修改一下

moooofly commented 5 years ago

相关


杂七杂八

js 的 error log 可以接 ELK 么?

原则

moooofly commented 5 years ago

Case 8 - 由于 ES 集群资源利用率达到饱和(写入性能)导致的问题

详见这里

image