thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

prometheusMetrics segmentsDone is less than segmentsTotal by 1 & no metric updated #1051

Open shaotingcheng opened 3 years ago

shaotingcheng commented 3 years ago

Project board link

I have two issues about the prometheusMetrics with Cassandra 3.11.3 and cassandra-reaper 2.2.3

  1. segmentsDone won't be updated at the last segment, so it's less than segmentsTotal by one.
  1. There is no new metrics for the new repair

My environment

# See a bit more complete example in:
# src/server/src/test/resources/cassandra-reaper.yaml
segmentCount: 200
repairParallelism: PARALLEL
repairIntensity: 0.9
scheduleDaysBetween: 7
repairRunThreadCount: 15
hangingRepairTimeoutMins: 60
storageType: cassandra
enableCrossOrigin: true
incrementalRepair: false
enableDynamicSeedList: true
repairManagerSchedulingIntervalSeconds: 10
activateQueryLogger: false
jmxConnectionTimeoutInSeconds: 5
useAddressTranslator: false

# datacenterAvailability has three possible values: ALL | LOCAL | EACH
# the correct value to use depends on whether jmx ports to C* nodes in remote datacenters are accessible
# If the reaper has access to all node jmx ports, across all datacenters, then configure to ALL.
# If jmx access is only available to nodes in the same datacenter as reaper in running in, then configure to LOCAL.
# If there's a reaper instance running in every datacenter, and it's important that nodes under duress are not involved in repairs,
#    then configure to EACH.
#
# The default is ALL
datacenterAvailability: ALL

#jmxPorts:

#jmxAuth:
#  username: myUsername
#  password: myPassword

logging:
  level: INFO
  loggers:
    com.datastax.driver.core.QueryLogger.NORMAL:
      level: DEBUG
      additive: false
      appenders:
        - type: file
          currentLogFilename: /var/log/cassandra-reaper/query-logger.log
          archivedLogFilenamePattern: /var/log/cassandra-reaper/query-logger-%d.log.gz
          archivedFileCount: 10
    io.dropwizard: WARN
    org.eclipse.jetty: WARN
  appenders:
    - type: console
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"
      threshold: WARN
    - type: file
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"
      currentLogFilename: /var/log/cassandra-reaper/reaper.log
      archivedLogFilenamePattern: /var/log/cassandra-reaper/reaper-%d.log.gz
      archivedFileCount: 30

server:
  type: default
  applicationConnectors:
    - type: http
      port: 8080
      bindHost: 0.0.0.0
  adminConnectors:
    - type: http
      port: 8081
      bindHost: 0.0.0.0
  requestLog:
    appenders: []

cassandra:
  clusterName: "QlearTest"
  contactPoints: ["db1","db2","db3"]
  keyspace: reaper_db
  loadBalancingPolicy:
    type: tokenAware
    shuffleReplicas: true
    subPolicy:
      type: dcAwareRoundRobin
      localDC:
      usedHostsPerRemoteDC: 0
      allowRemoteDCsForLocalConsistencyLevel: false

autoScheduling:
  enabled: false
  initialDelayPeriod: PT15S
  periodBetweenPolls: PT10M
  timeBeforeFirstSchedule: PT5M
  scheduleSpreadPeriod: PT6H
  excludedKeyspaces:
    - keyspace1
    - keyspace2

# Uncomment the following to enable dropwizard metrics
#  Configure to the reporter of your choice
#  Reaper also provides prometheus metrics on the admin port at /prometheusMetrics

#metrics:
#  frequency: 1 minute
#  reporters:
#    - type: log
#      logger: metrics

Please advise. Maybe I missed something

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-110

shaotingcheng commented 3 years ago

For those wrong metric, the run of repair was killed and I got this log WARN [2021-04-24 07:56:51,655] [qleartest:708d4400-a341-11eb-bfea-7ff2db79e593] i.c.s.RepairRunner - RepairRun "708d4400-a341-11eb-bfea-7ff2db79e593" does not exist. Killing RepairRunner for this run instance.

so it had no chance to update the metric