connection未正常关闭

hebo1982 commented 2 years ago

kibana链接gateway使用后，几天后kibana完全变为不可以用，发现gateway已经503显示集群不可以使用。但检查es集群是可以用的。

检查服务的链接发现大量gateway的连接未正常释放，重启gateway后正常。因没有源码无法确定问题出处，请检查一下是否有未正常关闭连接情况。

netstat.zip

netstat 是重启前的记录 netstat-restart 是重启后的记录

medcl commented 2 years ago

:9527 是代理 kibana 的转发端口么, 完整的配置发一下看看

hebo1982 commented 2 years ago

9527是es服务的端口

9527是es服务的端口

```yaml
path.data: data
path.logs: log
#log.level: trace
#log.debug: true

#api:
#  enabled: true
#  network:
#    binding: 127.0.0.1:2900

entry:
  - name: es_gateway
    enabled: true
    router: default_router
    max_concurrency: 10000
    network:
      binding: 0.0.0.0:9528
#    tls:
#      enabled: true

flow:
  - name: default_flow
    filter:
      - context_regex_replace:
          context: "_ctx.request.uri"
          pattern: "_from=[^&]*&?"
          to: ""
      - date_range_precision_tuning:
          time_precision: 6
      - get_cache:
      - bulk_reshuffle:
          when:
            contains:
              _ctx.request.path: /_bulk
          elasticsearch: prod
          level: node
          fix_null_id: true
      - elasticsearch:
          elasticsearch: prod  #elasticsearch configure reference name
          max_connection_per_node: 1000 #max tcp connection to upstream, default for all nodes
          max_response_size: -1 #default for all nodes
          balancer: weight
          refresh: # refresh upstream nodes list, need to enable this feature to use elasticsearch nodes auto discovery
            enabled: true
            interval: 60s
      - set_cache:
          cache_type: ristretto
          min_response_size: 100
          max_response_size: 1024000
          cache_ttl: 30s
          max_cache_items: 100000
      # - elasticsearch_health_check:
      #     elasticsearch: prod
  - name: logging # this flow is used for request logging, refer to `router`'s `tracing_flow`
    filter:
      - stats:
      - logging:
          queue_name: request_logging
          max_request_body_size: 1024
          max_response_body_size: 1024
          when: #>1s or none-200 requests will be logged
            or:
              - not:
                  or:
                    - equals:
                        _ctx.request.path: "/favicon.ico"
                    - equals:
                        _ctx.response.status: 200
                    - in:
                        _ctx.request.path: ["/sw.js"]
              - range:
                  _ctx.elapsed.gte: 1000
router:
  - name: default_router
    default_flow: default_flow
    tracing_flow: logging

elasticsearch:
- name: prod
  enabled: true
  schema: http
  hosts:
    - 10.0.6.146:9527 
    - 10.0.6.148:9527
    - 10.0.6.167:9527
  traffic_control: #global traffic control
    max_bps_per_node: 209715200 #max total bytes send to es per node, 200MB/s
    max_qps_per_node: 20000 #max total requests send to es per node, 20k/s
  basic_auth: #used to discovery full cluster nodes, or check elasticsearch's health and versions
    username: "xxxxx"
    password: "xxxxx"
  discovery: # auto discovery elasticsearch cluster nodes
    enabled: true
    refresh:
      enabled: true
      interval: 60s

# - name: dev
#   enabled: true
#   schema: http
#   hosts:
#     - 127.0.0.1:9527 
#   traffic_control: #global traffic control
#     max_bps_per_node: 209715200 #max total bytes send to es per node, 200MB/s
#     max_qps_per_node: 20000 #max total requests send to es per node, 20k/s
#   basic_auth: #used to discovery full cluster nodes, or check elasticsearch's health and versions
#     username: "xxxxx"
#     password: "xxxxx"
#   discovery: # auto discovery elasticsearch cluster nodes
#     enabled: true
#     refresh:
#       enabled: true
#       interval: 60s

# pipeline:
# - name: request_logging_index
#   auto_start: true
#   keep_running: true
#   processor:
#     - json_indexing:
#         index_name: "gateway_requests"
#         elasticsearch: "prod"
#         input_queue: "request_logging"
#         idle_timeout_in_seconds: 1
#         worker_size: 1
#         bulk_size_in_mb: 10 #in MB
#         when:
#           cluster_available: [ "prod" ]
# - name: bulk_request_ingest
#   auto_start: true
#   keep_running: true
#   processor:
#     - bulk_indexing:
#         elasticsearch: "prod"
#         max_worker_size: 10
#         bulk.compress: true
#         bulk_size_in_mb: 50  #in MB
#         retry_delay_in_seconds: 5
#         queues: #filter by labels
#           type: bulk_reshuffle
#         when:
#           cluster_available: [ "prod" ]

#floating_ip:
#  enabled: false
#  ip: 192.168.3.234      #yep, it's optional, infini-gateway could detect one but maybe not the right one
##  netmask: 255.255.255.0 #optional
##  interface: en1         #optional

#statsd:
#  enabled: true
#  host: 127.0.0.1
#  port: 8125
#  protocol: udp
#  namespace: gateway.
#  buffer_size: 102400

#redis:
#  enabled: true
#  host: localhost
#  port: 6379
#
#queue:
#  - name: dev-node-yNgVusnXSgqvP2fZeGFSLw
#    type: redis
#
#disk_queue:
#  upload_to_s3: true
#  s3:
#    server: my_blob_store
#    location: cn-beijing-001
#    bucket: infini-store
##  max_used_bytes: 102400
##  warning_free_bytes: 322122547200
##  reserved_free_bytes: 322122547200
##  max_bytes_per_file: 1048576
#
#
#s3:
#  my_blob_store:
#    endpoint: "192.168.3.63:9000"
#    access_key: "minio"
#    access_secret: "gogoaminio"
##    token: "XXXX"
##    ssl: true
#
#
#elastic:
#  elasticsearch: dev
#  enabled: true
#  remote_configs: true
#  health_check:
#    enabled: true
#  availability_check:
#    enabled: true
#  metadata_refresh:
#    enabled: true
#  store:
#    enabled: true
#  orm:
#    enabled: true
#    init_template: true
#    template_name: ".infini"
#    index_prefix: ".infini_"

medcl commented 2 years ago

你的服务端和网关的日志发一下, 不是没有优化服务器,达到文件打开数限制 4096 了吧

hebo1982 commented 2 years ago

@medcl

文件最大数如图

事故的日志

[08-18 06:11:10] [INF] [entry.go:331] entry [es_gateway] listen at: http://0.0.0.0:9528
[08-18 06:11:10] [INF] [module.go:116] all modules are started
[08-19 03:45:21] [INF] [app.go:163] initializing gateway.
[08-19 03:45:21] [INF] [app.go:164] using config: /opt/servers/gateway-1.6.0_SNAPSHOT/gateway.yml.
[08-19 03:45:21] [INF] [instance.go:72] workspace: /opt/servers/gateway-1.6.0_SNAPSHOT/data/gateway/nodes/cbpl6k15d1c7p5eqp7dg
[08-19 03:45:21] [INF] [app.go:272] gateway is up and running now.
[08-19 03:45:21] [INF] [api.go:261] api listen at: http://0.0.0.0:2900
[08-19 03:45:21] [INF] [actions.go:367] elasticsearch [prod] is available
[08-19 03:45:21] [INF] [reverseproxy.go:261] elasticsearch [prod] hosts: [] => [10.0.6.146:9527, 10.0.6.167:9527, 10.0.6.148:9527]
[08-19 03:45:21] [INF] [entry.go:331] entry [es_gateway] listen at: http://0.0.0.0:9528
[08-19 03:45:21] [INF] [module.go:116] all modules are started
[08-19 10:50:12] [ERR] [server.go:2301] error in serveConn, runtime error: invalid memory address or nil pointer dereference,runtime error: invalid memory address or nil pointer dereference
[08-19 10:50:45] [WRN] [reverseproxy.go:545] failed to proxy request: /.kibana_task_manager/_update_by_query?ignore_unavailable=true&refresh=true&max_docs=10&conflicts=proceed to host 10.0.6.167:9527, 0, retried: #0, error:timeout
[08-19 10:50:45] [WRN] [reverseproxy.go:545] failed to proxy request: /.reporting-%2A/_search to host 10.0.6.148:9527, 0, retried: #0, error:timeout
[08-19 10:51:13] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.146:9527, 0, retried: #0, error:timeout
[08-19 10:51:13] [WRN] [reverseproxy.go:545] failed to proxy request: 

省略若干timeout的日志

[08-19 22:51:56] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.148:9527, 0, retried: #0, error:timeout
[08-19 22:51:56] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.146:9527, 0, retried: #0, error:timeout
[08-19 22:51:59] [WRN] [reverseproxy.go:545] failed to proxy request: /_nodes?filter_path=nodes.%2A.version%2Cnodes.%2A.http.publish_address%2Cnodes.%2A.ip to host 10.0.6.167:9527, 0, retried: #0, error:timeout
[08-19 22:51:59] [WRN] [reverseproxy.go:545] failed to proxy request: /.kibana/_search?size=20&from=0&rest_total_hits_as_int=true to host 10.0.6.148:9527, 0, retried: #0, error:timeout
[08-19 22:52:18] [WRN] [reverseproxy.go:545] failed to proxy request: /.kibana_task_manager/_update_by_query?ignore_unavailable=true&refresh=true&max_docs=10&conflicts=proceed to host 10.0.6.146:9527, 0, retried: #0, error:timeout
[08-19 22:52:26] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.167:9527, 0, retried: #0, error:timeout
[08-19 22:52:26] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.148:9527, 0, retried: #0, error:timeout
[08-19 22:52:29] [WRN] [reverseproxy.go:545] failed to proxy request: /_nodes?filter_path=nodes.%2A.version%2Cnodes.%2A.http.publish_address%2Cnodes.%2A.ip to host 10.0.6.146:9527, 0, retried: #0, error:timeout
[08-19 22:52:48] [WRN] [reverseproxy.go:545] failed to proxy request: /.reporting-%2A/_search to host 10.0.6.167:9527, 0, retried: #0, error:timeout
[08-19 22:52:48] [WRN] [reverseproxy.go:545] failed to proxy request: /.kibana_task_manager/_update_by_query?ignore_unavailable=true&refresh=true&max_docs=10&conflicts=proceed to host 10.0.6.148:9527, 0, retried: #0, error:timeout
[08-19 22:52:56] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.167:9527, 0, retried: #0, error:timeout
[08-19 22:52:56] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.146:9527, 0, retried: #0, error:timeout
[08-19 22:52:59] [WRN] [reverseproxy.go:545] failed to proxy request: /_nodes?filter_path=nodes.%2A.version%2Cnodes.%2A.http.publish_address%2Cnodes.%2A.ip to host 10.0.6.148:9527, 0, retried: #0, error:timeout
[08-19 22:52:59] [WRN] [reverseproxy.go:545] failed to proxy request: /.kibana/_search?size=20&from=0&rest_total_hits_as_int=true to host 10.0.6.146:9527, 0, retried: #0, error:timeout
[08-19 22:53:18] [WRN] [reverseproxy.go:545] failed to proxy request: /.kibana_task_manager/_update_by_query?ignore_unavailable=true&refresh=true&max_docs=10&conflicts=proceed to host 10.0.6.167:9527, 0, retried: #0, error:timeout
[08-19 22:53:26] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.146:9527, 0, retried: #0, error:timeout
[08-19 22:53:26] [WRN] [reverseproxy.go:545] failed to proxy request: /_xpack to host 10.0.6.148:9527, 0, retried: #0, error:timeout
[08-19 22:53:29] [WRN] [reverseproxy.go:545] failed to proxy request: /_nodes?filter_path=nodes.%2A.version%2Cnodes.%2A.http.publish_address%2Cnodes.%2A.ip to host 10.0.6.167:9527, 0, retried: #0, error:timeout
[08-19 22:53:39] [INF] [module.go:145] all modules are stopped
[08-19 22:53:39] [INF] [app.go:256] gateway now terminated.

medcl commented 2 years ago

系统的 dmesg 有没有异常日志?

medcl commented 2 years ago

这行日志报空指针了, 启动的时候可以加上 -debug 参数, 看看报错的完整堆栈信息

medcl commented 2 years ago

另外 Gateway 是什么版本?

hebo1982 commented 2 years ago

另外 Gateway 是什么版本?

gateway 1.7.0_SNAPSHOT 705 2022-08-16 03:13:43 +0000 UTC 2023-12-31 10:10:10 +0000 UTC 4b2bdc84d5c1d9882493adfe43486d8dfaab68bc

-debug 我加下。

medcl commented 2 years ago

Kibana 的服务器日志也看看,是不是优化了文件句柄数

medcl commented 2 years ago

我看你用 reshuffle,但是没有开消费管道,这样是不对的, 或者用下面的配置.

path.data: data
path.logs: log
#log.level: trace
#log.debug: true

#api:
#  enabled: true
#  network:
#    binding: 127.0.0.1:2900

entry:
  - name: es_gateway
    enabled: true
    router: default_router
    max_concurrency: 10000
    network:
      binding: 0.0.0.0:9528
#    tls:
#      enabled: true

flow:
  - name: default_flow
    filter:
#      - context_regex_replace:
#          context: "_ctx.request.uri"
#          pattern: "_from=[^&]*&?"
#          to: ""
      - date_range_precision_tuning:
          time_precision: 6
      - get_cache:
#      - bulk_reshuffle:
#          when:
#            contains:
#              _ctx.request.path: /_bulk
#          elasticsearch: prod
#          level: node
#          fix_null_id: true
      - elasticsearch:
          elasticsearch: prod  #elasticsearch configure reference name
          max_connection_per_node: 1000 #max tcp connection to upstream, default for all nodes
          max_response_size: -1 #default for all nodes
          balancer: weight
          refresh: # refresh upstream nodes list, need to enable this feature to use elasticsearch nodes auto discovery
            enabled: true
            interval: 60s
      - set_cache:
          cache_type: ristretto
          min_response_size: 100
          max_response_size: 1024000
          cache_ttl: 30s
          max_cache_items: 100000
      # - elasticsearch_health_check:
      #     elasticsearch: prod
  - name: logging # this flow is used for request logging, refer to `router`'s `tracing_flow`
    filter:
      - stats:
      - logging:
          queue_name: request_logging
          max_request_body_size: 1024
          max_response_body_size: 1024
#          when: #>1s or none-200 requests will be logged
#            or:
#              - not:
#                  or:
#                    - equals:
#                        _ctx.request.path: "/favicon.ico"
#                    - equals:
#                        _ctx.response.status: 200
#                    - in:
#                        _ctx.request.path: ["/sw.js"]
#              - range:
#                  _ctx.elapsed.gte: 1000
router:
  - name: default_router
    default_flow: default_flow
    tracing_flow: logging

elasticsearch:
- name: prod
  enabled: true
  schema: http
  hosts:
    - 192.168.3.188:9298
    - 10.0.6.146:9527
    - 10.0.6.148:9527
    - 10.0.6.167:9527
  traffic_control: #global traffic control
    max_bps_per_node: 209715200 #max total bytes send to es per node, 200MB/s
    max_qps_per_node: 20000 #max total requests send to es per node, 20k/s
  basic_auth: #used to discovery full cluster nodes, or check elasticsearch's health and versions
    username: "xxxxx"
    password: "xxxxx"
  discovery: # auto discovery elasticsearch cluster nodes
    enabled: true
    refresh:
      enabled: true
      interval: 60s

pipeline:
-  name: request_logging_index
   auto_start: true
   keep_running: true
   processor:
     - json_indexing:
         index_name: "gateway_requests"
         elasticsearch: "prod"
         input_queue: "request_logging"
         idle_timeout_in_seconds: 1
         worker_size: 1
         bulk_size_in_mb: 10 #in MB
         when:
           cluster_available: [ "prod" ]

hebo1982 commented 2 years ago

@medcl 我试下这个配置

hebo1982 commented 2 years ago

重启后，未出现类似问题。持续再观察下。

infinilabs / gateway

connection未正常关闭 #28