[Bug]: config proxy cache connection mo hung

heni02 commented 2 weeks ago

Is there an existing issue for the same bug?

[X] I have checked the existing issues.

Branch Name

2.0-dev

Commit ID

3ecc49ab9

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

测试场景： tke proxy配置conn-cache-enabled = true ,测试工具配置为短连接，并发100点查测试一会儿后，客户端无法登陆mo

企业微信截图_3ad5133a-58d3-4741-abdd-ee71c549462e

tke yaml proxy配置：

企业微信截图_8d1f07b7-74e3-494f-847b-9bd6cd915b34

mo-load测试工具配置为短连接：（具体使用见复现步骤）

企业微信截图_d791f333-476b-4229-a2c2-a05e06e72cc6

mo log： https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22pGq%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-nightly-2e5ddb165-20241109%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731404462204%22,%22to%22:%221731411550749%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1.create database sysbench_db;
2.使用mo-sysbench工具加载sysbench100w点查数据
sysbench  --mysql-host=127.0.0.1 --mysql-port=6001 --mysql-user=dump --mysql-password=111 oltp_point_select.lua --mysql-db=sysbench_db --tables=10 --table_size=1000000 --threads=100 --time=30 --report-interval=10    prepare
3.使用mo-load功能并发100点查
a.配置mo.yml为短连接con_mode="short"
b.配置cases/sysbench/point_select_10_1000000_prepare/run.yml
#stdout=console, 过程数据输出到控制台和文件
duration: 10 #所有transaction的执行时间，单位分钟
transaction:
- name: "point_select_10_1000000_prepare"  #transaction名称
  vuser: 1000   #执行该transaction测试的并发量
  mode: 0 #执行模式，值为0表示script的中sql直接顺序执行，值为1表示把script的sql封装成一个数据库事务进行执行
  prepared: "true"   #是否需要对script的sql进行prepare，如果为true
  #transaction的sql语句,可多条
  script:
    - sql: "select k from sysbench_db.sbtest{tbx} where id = ?;"
      paras: INT({id})
c.执行并发测试 ./start.sh -h 172.16.105.206 -c cases/sysbench/point_select_10_1000000_prepare

Additional information

No response

heni02 commented 1 week ago

main commit：f22612c 回归验证多个cn panic报错重启

企业微信截图_4dc63787-003c-40ab-92ec-24e84efee8f0

企业微信截图_e6b7743a-d9c9-42a5-9915-f7853eeec071

mo log： https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22n3z%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-ben-nightly-48c7e1698-20241103%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731920503533%22,%22to%22:%221731921330341%22%7D%7D%7D&schemaVersion=1&orgId=1

volgariver6 commented 1 week ago

fixed

heni02 commented 1 week ago

2.0-dev commit：ff4db5805 cn还是有panic错误

企业微信截图_7efc510d-62b8-4fc0-b3f3-4678378d7913

panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3b45c0f]

goroutine 12517 gp=0xc00f46ae00 m=0 mp=0x851b900 [running]: panic({0x4706080?, 0x831fc30?}) /usr/local/go/src/runtime/panic.go:804 +0x168 fp=0xc040e06a68 sp=0xc040e069b8 pc=0x47bc88 runtime.panicmem(...) /usr/local/go/src/runtime/panic.go:262 runtime.sigpanic() /usr/local/go/src/runtime/signal_unix.go:900 +0x359 fp=0xc040e06ac8 sp=0xc040e06a68 pc=0x47e339 github.com/matrixorigin/matrixone/pkg/frontend.(TxnHandler).GetServerStatus(0x57dc478?) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/txn.go:751 +0x2f fp=0xc040e06b08 sp=0xc040e06ac8 pc=0x3b45c0f github.com/matrixorigin/matrixone/pkg/frontend.ExecRequest.func1() /go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3100 +0x173 fp=0xc040e06ba0 sp=0xc040e06b08 pc=0x3ab6fd3 panic({0x4706080?, 0x831fc30?}) /usr/local/go/src/runtime/panic.go:785 +0x132 fp=0xc040e06c50 sp=0xc040e06ba0 pc=0x47bc52 runtime.panicmem(...) /usr/local/go/src/runtime/panic.go:262 runtime.sigpanic() /usr/local/go/src/runtime/signal_unix.go:900 +0x359 fp=0xc040e06cb0 sp=0xc040e06c50 pc=0x47e339 github.com/matrixorigin/matrixone/pkg/vm/process.(Process).ReplaceTopCtx(...) /go/src/github.com/matrixorigin/matrixone/pkg/vm/process/process2.go:158 github.com/matrixorigin/matrixone/pkg/frontend.doComQuery(0xc023677208, 0xc0286dcc80, 0xc0416fae00) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:2861 +0x49e fp=0xc040e07238 sp=0xc040e06cb0 pc=0x3ab29fe github.com/matrixorigin/matrixone/pkg/frontend.ExecRequest(0xc023677208, 0xc0286dcc80, 0xc040e07b88) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3127 +0x7a5 fp=0xc040e075b8 sp=0xc040e07238 pc=0x3ab5245 github.com/matrixorigin/matrixone/pkg/frontend.(Routine).handleRequest(0xc02a1f8600, 0xc040e07b88) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine.go:298 +0x61d fp=0xc040e07a28 sp=0xc040e075b8 pc=0x3aff31d github.com/matrixorigin/matrixone/pkg/frontend.(RoutineManager).Handler(0xc00104a780, 0xc012e10000, {0xc044bcc000, 0x7ffa5, 0x7ffa5}) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine_manager.go:385 +0x327 fp=0xc040e07c40 sp=0xc040e07a28 pc=0x3b048c7 github.com/matrixorigin/matrixone/pkg/frontend.(MOServer).handleRequest(0xc00ee3a320, 0xc012e10000) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:516 +0x1eb fp=0xc040e07d10 sp=0xc040e07c40 pc=0x3b11beb github.com/matrixorigin/matrixone/pkg/frontend.(MOServer).handleMessage(0xc00ee3a320, {0x57dc4b0, 0xc000b99680}, 0xc012e10000) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:484 +0x94 fp=0xc040e07de8 sp=0xc040e07d10 pc=0x3b11854 github.com/matrixorigin/matrixone/pkg/frontend.(MOServer).handleLoop(0xc00ee3a320?, {0x57dc4b0?, 0xc000b99680?}, 0xc000ba5380?) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:212 +0x2f fp=0xc040e07ea8 sp=0xc040e07de8 pc=0x3b0e22f github.com/matrixorigin/matrixone/pkg/frontend.(MOServer).handleConn(0xc00ee3a320, {0x57dc4b0, 0xc000b99680}, {0x5822a98?, 0xc007712c98?}) /go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:208 +0x4a6 fp=0xc040e07fa8 sp=0xc040e07ea8 pc=0x3b0e006 github.com/matrixorigin/matrixone/pkg/frontend.(MOServer).startAccept.gowrap2() /go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:177 +0x30 fp=0xc040e07fe0 sp=0xc040e07fa8 pc=0x3b0dad0 runtime.goexit({}) /usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc040e07fe8 sp=0xc040e07fe0 pc=0x484d41 created by github.com/matrixorigin/matrixone/pkg/frontend.(MOServer).startAccept in goroutine 917 /go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:177 +0x165

panic 日志： panic.log

heni02 commented 1 week ago

集群yaml文件:

apiVersion: core.matrixorigin.io/v1alpha1
kind: MatrixOneCluster
metadata:
  name: nightly-regression-dis
  namespace: mo-ben-nightly-48c7e1698-20241103
spec:
  semanticVersion: 1.3.0
  dn:
    exportToPrometheus: true
    nodeSelector:
      tke.matrixorigin.io/mo-nightly-regression: "true"
    overlay:
      initContainers:
        - image: ccr.ccs.tencentyun.com/matrixone-dev/matrixone:nightly-ff4db5805
          command:
            - sh
            - -c
            - |
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w net.ipv4.tcp_fin_timeout=30
          imagePullPolicy: Always
          name: setsysctl
          terminationMessagePolicy: File
          securityContext:
            capabilities:
              add: ["NET_ADMIN","SYS_ADMIN"]  
      podAnnotations:
        profiles.grafana.com/memory.scrape: "true"
        profiles.grafana.com/memory.port: "6060"
        profiles.grafana.com/cpu.scrape: "true"
        profiles.grafana.com/cpu.port: "6060"
      imagePullSecrets:
        - name: tke-registry
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/local-pv
          operator: Exists
      env:
      - name: GOMEMLIMIT
        value: "35000MiB"
      - name: GOTRACEBACK
        value: crash
      - name: GOGC
        value: "200"
      shareProcessNamespace: true   
    cacheVolume:
      size: 50Gi
      storageClassName: directpv-min-io
    sharedStorageCache:
      memoryCacheSize: 5Gi
      diskCacheSize: 50Gi
    config: |
      [dn.Txn.Storage]
      backend = "TAE"
      log-backend = "logservice"
      [log]
      level = "info"
      format = "json"
      max-size = 512
      [dn.Ckp]
      flush-interval = "60s"
      min-count = 100
      scan-interval = "5s"
      incremental-interval = "60s"
      global-interval = "100000s"
      [dn.LogtailServer]
      rpc-max-message-size = "16KiB"
      rpc-payload-copy-buffer-size = "16KiB"
      rpc-enable-checksum = true
      logtail-collect-interval = "2ms"
      logtail-response-send-timeout = "10s"
      max-logtail-fetch-failure = 5
      [observability]
      metricUpdateStorageUsageInterval = "15m"
      enableStmtMerge = true
      enableMetricToProm = true
      [dn.GCCfg]
      disable-gc = true
      [dn.rpc]
      max-message-size = "1000M"
    replicas: 1
    resources:
      requests:
        cpu: 14
        memory: 55Gi
      limits:
        cpu: 14
        memory: 55Gi
  imageRepository: ccr.ccs.tencentyun.com/matrixone-dev/matrixone
  imagePullPolicy: IfNotPresent
  logService:
    exportToPrometheus: true
    nodeSelector:
      tke.matrixorigin.io/mo-nightly-regression-log: "true"
    overlay:
      podAnnotations:
        profiles.grafana.com/memory.scrape: "true"
        profiles.grafana.com/memory.port: "6060"
        profiles.grafana.com/cpu.scrape: "true"
        profiles.grafana.com/cpu.port: "6060"
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/mo-nightly-regression-log
        operator: Exists    
      imagePullSecrets:
      - name: tke-registry
      env:
      - name: GOTRACEBACK
        value: crash 
      shareProcessNamespace: true
    replicas: 3
    resources:
      requests:
        cpu: 2
        memory: 12Gi
      limits:
        cpu: 3
        memory: 14Gi  
    sharedStorage:
      s3:
        endpoint: https://cos.ap-guangzhou.myqcloud.com
        region: ap-guangzhou
        path: mo-nightly-gz-1308875761/mo-benchmark-1148034539
        s3RetentionPolicy: Delete
        secretRef:
          name: tke-regression
    pvcRetentionPolicy: Delete
    volume:
      size: 100Gi
      storageClassName: cbs-hssd
    config: |
      [log]
      level = "info"
      format = "json"
      max-size = 512
      [observability]
      metricUpdateStorageUsageInterval = "15m"
      enableStmtMerge = true
      enableMetricToProm = true
  tp:
    exportToPrometheus: true
    nodeSelector:
      tke.matrixorigin.io/mo-nightly-regression: "true"
    overlay:
      initContainers:
        - image: ccr.ccs.tencentyun.com/matrixone-dev/matrixone:nightly-ff4db5805
          command:
            - sh
            - -c
            - |
              apt update -y;
              apt install -y iptables conntrack;
              iptables -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT;
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w net.ipv4.tcp_fin_timeout=30
          imagePullPolicy: Always
          name: enable-conntrack
          terminationMessagePolicy: File
          securityContext:
            capabilities:
              add: ["NET_ADMIN","SYS_ADMIN"]
      mainContainerSecurityContext:
        capabilities:
          add: ["NET_ADMIN","NET_RAW"]        
      podAnnotations:
        profiles.grafana.com/memory.scrape: "true"
        profiles.grafana.com/memory.port: "6060"
        profiles.grafana.com/cpu.scrape: "true"
        profiles.grafana.com/cpu.port: "6060"
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/local-pv
          operator: Exists 
      imagePullSecrets:
        - name: tke-registry
      env:
      - name: GOMEMLIMIT
        value: "25000MiB"
      - name: GOTRACEBACK
        value: crash
      - name: GOGC
        value: "200"
      - name: GODEBUG
        value: madvdontneed=1,gctrace=2
      args:
      - -profile-interval=30s
      - -debug-http=0.0.0.0:6060
      shareProcessNamespace: true  
    cacheVolume:
      size: 3000Gi
      storageClassName: directpv-min-io
    sharedStorageCache:
      memoryCacheSize: 12Gi
      diskCacheSize: 3000Gi
    config: |
      [cn.Engine]
      type = "distributed-tae"
      [log]
      level = "info"
      format = "json"
      max-size = 512
      [cn]
      turn-on-push-model = true
      [cn.txn]
      enable-sacrificing-freshness = 1
      enable-cn-based-consistency = 0
      enable-leak-check = 1
      max-active-ages = "20m"
      [observability]
      metricUpdateStorageUsageInterval = "15m"
      enableStmtMerge = true
      enableMetricToProm = true
      [cn.txn.trace]
      load-to-s3 = true
      flush-bytes = "256MB"
      force-flush-duration = "300s"
      [cn.rpc]
      max-message-size = "1000M"
    replicas: 3
    resources:
      requests:
        cpu: 14
        memory: 55Gi
      limits:
        cpu: 14
        memory: 55Gi
  proxy:
    replicas: 2
    nodeSelector:
      tke.matrixorigin.io/mo-nightly-regression-proxy: "true"
    overlay:
      initContainers:
        - image: ccr.ccs.tencentyun.com/matrixone-dev/matrixone:nightly-ff4db5805
          command:
            - sh
            - -c
            - |
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w net.ipv4.tcp_fin_timeout=30
          imagePullPolicy: Always
          name: setsysctl
          terminationMessagePolicy: File
          securityContext:
            capabilities:
              add: ["NET_ADMIN","SYS_ADMIN"]
      podAnnotations:
        profiles.grafana.com/memory.scrape: "true"
        profiles.grafana.com/memory.port: "6060"
        profiles.grafana.com/cpu.scrape: "true"
        profiles.grafana.com/cpu.port: "6060"
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/mo-nightly-regression-proxy
          operator: Exists
      imagePullSecrets:
        - name: tke-registry
    resources:
      # requests are the requested resources, this will also be used to schedule the LogService Pod
      requests:
        cpu: 3
        memory: 6Gi
      # limits are the resource limitation of the Pod
      limits:
        cpu: 3
        memory: 6Gi
    config: |
      # TOML format config file below
      [log]
      level="info"
      [proxy]
      conn-cache-enabled = true 
  version: nightly-ff4db5805

aressu1985 commented 4 days ago

问题修复后，还是需要进行大量的测试，2.0.1计划不打开该能力，DEALY到下个版本进行测试

volgariver6 commented 1 day ago

in process

matrixorigin / matrixone