m3db / m3

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform
https://m3db.io/
Apache License 2.0
4.75k stars 453 forks source link

M3DB unable to mark namespace as ready without force params. #3649

Open yuandongjian opened 3 years ago

yuandongjian commented 3 years ago

M3DB unable to mark namespace as ready without force params.

curl -X POST http://localhost:7201/api/v1/services/m3db/namespace/ready -d '{
  "name": "default"
}
{"status":"error","error":"could not find db session for namespace: default"}
curl -X POST http://localhost:7201/api/v1/services/m3db/namespace/ready -d '{
  "name": "default",
  "force": true
}
{"ready":true}

m3db version: 1.1.0

m3coordinator.yml

listenAddress: 0.0.0.0:7201
logging:
  level: info
clusters:
  - namespaces:
      - namespace: default
        retention: 720h
        type: unaggregated
    client:
      config:
        service:
            env: default_env
            zone: embedded
            service: m3db
            etcdClusters:
                - zone: embedded
                  endpoints:
                     - x1:2379
                     - x2:2379
                     - x3:2379
      writeConsistencyLevel: majority
      readConsistencyLevel: unstrict_majority

m3dbnode.yml

db:
  logging:
    level: info

  metrics:
    prometheus:
      handlerPath: /metrics
      listenAddress: 0.0.0.0:7204
    sanitization: prometheus
    samplingRate: 1.0
    extended: detailed

  listenAddress: 0.0.0.0:9000
  clusterListenAddress: 0.0.0.0:9001
  httpNodeListenAddress: 0.0.0.0:9002
  httpClusterListenAddress: 0.0.0.0:9003
  debugListenAddress: 0.0.0.0:9004

  hostID:
    resolver: config
    value: m3db168

  client:
    writeConsistencyLevel: majority
    readConsistencyLevel: unstrict_majority
    writeTimeout: 10s
    fetchTimeout: 15s
    connectTimeout: 20s
    writeRetry:
        initialBackoff: 500ms
        backoffFactor: 3
        maxRetries: 2
        jitter: true
    fetchRetry:
        initialBackoff: 500ms
        backoffFactor: 2
        maxRetries: 3
        jitter: true
    backgroundHealthCheckFailLimit: 4
    backgroundHealthCheckFailThrottleFactor: 0.5

  writeNewSeriesAsync: true
  writeNewSeriesBackoffDuration: 2ms

  filesystem:
    filePathPrefix: /opt/work/m3db

  discovery:
    config:
        service:
            env: default_env
            zone: embedded
            service: m3db
            # etcd集群配置
            cacheDir: /opt/work/m3db/data
            etcdClusters:
                - zone: embedded
                  endpoints:
                     - x1:2379
                     - x2:2379
                     - x3:2379
wesleyk commented 3 years ago

@crazy-pizza that error comes up if the coordinator cannot connect to m3db. Are you able to confirm it's able to connect?

Also the etcd cluster endpoints you have listed seem potentially suspect, are you services connecting healthily to the underlying etcd cluster?

yuandongjian commented 3 years ago

@wesleyk It seems that coordinator can connect to m3DB normally. After namespace is set to ready with force params, the cluster can also read and write data normally. Everything seems to be working just fine, except for the need to force the namespace.

It seems that m3DB has connected to the ETCD cluster, and there is no error in the M3DB log, what's wrong with etcd cluster endpoints?

Here is the data in ectd

[root@ etcd]# ./etcdctl --endpoints=$ENDPOINTS get  --prefix  "" --keys-only=true
/placement/default_env/m3aggregator

/placement/namespace/m3db-cluster-name/m3aggregator

/topic/namespace/m3db-cluster-name/aggregated_metrics

/topic/namespace/m3db-cluster-name/aggregated_metrics2

/topic/namespace/m3db-cluster-name/aggregator_ingest

_kv/default_env/m3db.node.namespaces

_kv/namespace/m3db-cluster-name/shardset/1/flush

_sd.placement/default_env/m3db

_sd.placement/namespace/m3db-cluster-name/m3coordinator
wesleyk commented 3 years ago

@crazy-pizza sorry for the delay. At this point it'll be harder to debug, though if you're able to reproduce from fresh, then providing coordinator and db logs would be helpful.