Open xiangyi-yang opened 4 years ago
We meet the same problem, my block size is 6 hrs, disk will be flushed at 2:00, 8:00, 14:00 and 20:00.
At 19:50, when we querying 1hr data, the series missing randomly. When we querying 6hr data, the problem still there, but when we querying 7 hour data, everything is ok!!!
At 20:10, everything is ok too, even we query only 1 hour data.
This case happen so often , so it blocked my project to be used in production env.
Is this the common case , or just some configration problem? thank you.
Would you mind sharing your namespace and coordinator configs (Also if you're using graphite style metrics or Prom style)?
Does this only work with m3query, or do you also see the issue when you use Prometheus remote read?
Both access via m3query and promethues are missing some series.The configuration is as follows:
namespace:
"app_1m": {
"bootstrapEnabled": true,
"cleanupEnabled": true,
"coldWritesEnabled": false,
"flushEnabled": true,
"indexOptions": {
"blockSizeNanos": "21600000000000",
"enabled": true
},
"repairEnabled": false,
"retentionOptions": {
"blockDataExpiry": true,
"blockDataExpiryAfterNotAccessPeriodNanos": "300000000000",
"blockSizeNanos": "21600000000000",
"bufferFutureNanos": "600000000000",
"bufferPastNanos": "1200000000000",
"futureRetentionPeriodNanos": "0",
"retentionPeriodNanos": "64368000000000000"
},
"schemaOptions": null,
"snapshotEnabled": true,
"writesToCommitLog": true
}
coordinator:
listenAddress:
type: "config"
value: "0.0.0.0:17201"
metrics:
scope:
prefix: "coordinator"
prometheus:
handlerPath: /metrics
listenAddress: 0.0.0.0:17202
sanitization: prometheus
samplingRate: 1.0
extended: none
clusters:
- namespaces:
- namespace: app_1m
retention: 17880h
type: unaggregated
client:
config:
service:
env: default_env
zone: embedded
service: m3db
cacheDir: /apps/dat/m3db/m3kv_coordinator
etcdClusters:
- zone: embedded
endpoints:
- x.x.x.x:2379
- y.y.y.y:2379
- z.z.z.z:2379
writeConsistencyLevel: majority
readConsistencyLevel: unstrict_majority
writeTimeout: 10s
fetchTimeout: 15s
connectTimeout: 20s
writeRetry:
initialBackoff: 500ms
backoffFactor: 3
maxRetries: 2
jitter: true
fetchRetry:
initialBackoff: 500ms
backoffFactor: 2
maxRetries: 3
jitter: true
backgroundHealthCheckFailLimit: 4
backgroundHealthCheckFailThrottleFactor: 0.5
tagOptions:
idScheme: quoted
query:
listenAddress:
type: "config"
value: "0.0.0.0:17203"
metrics:
scope:
prefix: "coordinator"
prometheus:
handlerPath: /metrics
listenAddress: 0.0.0.0:17204
sanitization: prometheus
samplingRate: 1.0
extended: none
tagOptions:
idScheme: quoted
#limits:
# perQuery:
# maxFetchedSeries: 500
clusters:
- namespaces:
- namespace: app_1m
type: unaggregated
retention: 17880h
client:
config:
service:
env: default_env
zone: embedded
service: m3db
cacheDir: /apps/dat/m3db/m3kv
etcdClusters:
- zone: embedded
endpoints:
- x.x.x.x:2379
- y.y.y.y:2379
- z.z.z.z:2379
writeConsistencyLevel: majority
readConsistencyLevel: all
writeTimeout: 10s
fetchTimeout: 30s
connectTimeout: 20s
writeRetry:
initialBackoff: 500ms
backoffFactor: 3
maxRetries: 2
jitter: true
fetchRetry:
initialBackoff: 500ms
backoffFactor: 2
maxRetries: 3
jitter: true
backgroundHealthCheckFailLimit: 4
backgroundHealthCheckFailThrottleFactor: 0.5
dbnode:
db:
logging:
level: info
metrics:
prometheus:
handlerPath: /metrics
sanitization: prometheus
samplingRate: 1.0
extended: detailed
hostID:
# resolver: environment
# envVarName: M3DB_HOST_ID
resolver: hostname
envVarName: M3DB-DBNODE-A-002
# Fill-out the following and un-comment before using.
config:
service:
env: default_env
zone: embedded
service: m3db
cacheDir: /apps/dat/m3db/m3kv
etcdClusters:
- zone: embedded
endpoints:
- x.x.x.x:2379
- y.y.y.y:2379
- z.z.z.z:2379
listenAddress: 0.0.0.0:9000
clusterListenAddress: 0.0.0.0:9001
httpNodeListenAddress: 0.0.0.0:9002
httpClusterListenAddress: 0.0.0.0:9003
debugListenAddress: 0.0.0.0:9004
client:
writeConsistencyLevel: majority
readConsistencyLevel: majority
gcPercentage: 80
writeNewSeriesAsync: true
writeNewSeriesLimitPerSecond: 1048576
writeNewSeriesBackoffDuration: 2ms
bootstrap:
bootstrappers:
- filesystem
- peers
- commitlog
- uninitialized_topology
fs:
numProcessorsPerCPU: 0.125
commitlog:
returnUnfulfilledForCorruptCommitLogFiles: false
cache:
series:
policy: lru
postingsList:
size: 262144
commitlog:
flushMaxBytes: 524288
flushEvery: 5s
queue:
calculationType: fixed
size: 2097152
fs:
filePathPrefix: /apps/dat/m3db
You're probably hitting the limit on index reads when you query.
Check for the M3-Results-Limited
response header, it's not super surprising considering the size of your unaggregated namespace, which we recommend to be at most 48 hours
You can also increase the limit in either the configs or by adding &limit=XXXX
to your query string, but this may cause your cluster to OOM
The query series is about 100, the query time span is 1 hour, and the granularity is 1 minute. The limit should not be reached. The data required for the query time period is only missing when the data is in the buffer. If the query time period contains data that has been flushed to disk, the series will not be missing.
Our found that the reason for the lack of data was because the indexing of new data was too slow. The "record the end to end indexing latency" step affected the speed of the index. Commenting out this code returned to normal.
m3/src/dbnode/storage/index.go
// record the end to end indexing latency
now := i.nowFn()
for idx := range pending {
took := now.Sub(pending[idx].EnqueuedAt)
i.metrics.InsertEndToEndLatency.Record(took)
}
// record the end to end indexing latency
just because summary type metric has a alient performance problem ? Observations are expensive due to the streaming quantile calculation ? same to prometheus summary ? https://prometheus.io/docs/practices/histograms/
Querying the latest data that has not been written to disk has a higher probability of missing data. If the query time span is part of the current block and part of the query has been written to the disk, the lack of data rarely occurs.