m3db / m3

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform
https://m3db.io/
Apache License 2.0
4.76k stars 453 forks source link

[problem] stuck in bootstrap namespace, strace pid suggests ( db file Operation not permitted) #1440

Closed naughtyGitCat closed 5 years ago

naughtyGitCat commented 5 years ago

problem binary

m3dbnode

problem occurs

Initializing a three RF with three isolation_group, one of the three m3dbnodes stuck in bootstrapping default namespace

the m3dbnode

environments

[zzz@10.200.183.69 15:57:16 ~/monitor-server/m3db]$ top
top - 15:58:57 up 81 days, 23:38,  3 users,  load average: 2.01, 1.53, 1.41
Tasks: 380 total,   1 running, 379 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  1.3 sy,  0.0 ni, 96.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13171493+total, 11085758+free,  9827604 used, 11029748 buff/cache
KiB Swap:  8388604 total,  8388604 free,        0 used. 12055723+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
36223 xxx       20   0 20.277g 8.478g 789848 S 114.2  6.7  55:15.61 /home/zzz/monitor-server/m3db/m3dbnode -f /home/zzz/monitor-server/m3db/node1.yml

[zzz@10.200.183.69 15:58:59 ~/monitor-server/m3db]$ ll /data1/m3db/
total 12
drwxr-xr-x 2 zzz zzz 4096 Mar  9 15:31 commitlogs
drwxr-xr-x 3 zzz zzz 4096 Mar  9 15:31 data
drwxr-xr-x 3 zzz zzz 4096 Mar  9 15:33 index

pstack m3dbnode process

Thread 1 (process 36223):
#0  runtime.futex () at /usr/local/Cellar/go/1.11.2/libexec/src/runtime/sys_linux_amd64.s:532
#1  0x0000000000429de0 in runtime.futexsleep (addr=0x27c3fa0 <runtime.timers+1696>, ns=2881571743, val=0) at /usr/local/Cellar/go/1.11.2/libexec/src/runtime/os_linux.go:63
#2  0x000000000040ab0e in runtime.notetsleep_internal (n=0x27c3fa0 <runtime.timers+1696>, ns=2881571743, ~r2=<optimized out>) at /usr/local/Cellar/go/1.11.2/libexec/src/runtime/lock_futex.go:193
#3  0x000000000040ac6f in runtime.notetsleepg (n=0x27c3fa0 <runtime.timers+1696>, ns=2881571743, ~r2=<optimized out>) at /usr/local/Cellar/go/1.11.2/libexec/src/runtime/lock_futex.go:228
#4  0x000000000044b1de in runtime.timerproc (tb=<optimized out>) at /usr/local/Cellar/go/1.11.2/libexec/src/runtime/time.go:288
#5  0x000000000045bc11 in runtime.goexit () at /usr/local/Cellar/go/1.11.2/libexec/src/runtime/asm_amd64.s:1333
#6  0x00000000027c3f80 in runtime.timers ()
#7  0x0000000000000000 in ?? ()

tailf m3dbnode log

2019/03/09 15:31:28 Go Runtime version: go1.11.2
2019/03/09 15:31:28 Build Version:      v0.6.1
2019/03/09 15:31:28 Build Revision:     e0c976b
2019/03/09 15:31:28 Build Branch:       master
2019/03/09 15:31:28 Build Date:         2019-02-20-13:26:06
2019/03/09 15:31:28 Build TimeUnix:     1550687166
15:31:28.831053[I] no seed nodes set, using dedicated etcd cluster
2019-03-09T15:31:28.850+0800    INFO    resolved cluster namespace  {"namespace": "default"}
2019-03-09T15:31:28.850+0800    INFO    resolved cluster namespace  {"namespace": "aggregated"}
15:31:29.146446[W] max index query IDs concurrency was not set, falling back to default value
15:31:29.146744[W] host doesn't support HugeTLB, proceeding without it
2019-03-09T15:31:29.157+0800    INFO    configuring downsampler to use with aggregated cluster namespaces   {"numAggregatedClusterNamespaces": 1}
2019-03-09T15:31:29.161+0800    INFO    no m3msg server configured
2019-03-09T15:31:29.161+0800    INFO    starting server {"address": "0.0.0.0:7201"}
15:31:29.172231[I] bytes pool registering bucket capacity=824638689960, size=824638689984, refillLowWatermark=%!f(*float64=0x275e4f0), refillHighWatermark=%!f(*float64=0x275e4e8)
15:31:29.172255[I] bytes pool registering bucket capacity=824638689992, size=824638690000, refillLowWatermark=%!f(*float64=0x275e4f0), refillHighWatermark=%!f(*float64=0x275e4e8)
15:31:29.172261[I] bytes pool registering bucket capacity=824638690008, size=824638690016, refillLowWatermark=%!f(*float64=0x275e4f0), refillHighWatermark=%!f(*float64=0x275e4e8)
15:31:29.172266[I] bytes pool registering bucket capacity=824638690024, size=824638690032, refillLowWatermark=%!f(*float64=0x275e4f0), refillHighWatermark=%!f(*float64=0x275e4e8)
15:31:29.172271[I] bytes pool registering bucket capacity=824638690040, size=824638690048, refillLowWatermark=%!f(*float64=0x275e4f0), refillHighWatermark=%!f(*float64=0x275e4e8)
15:31:29.172276[I] bytes pool registering bucket capacity=824638690056, size=824638690064, refillLowWatermark=%!f(*float64=0x275e4f0), refillHighWatermark=%!f(*float64=0x275e4e8)
15:31:29.172281[I] bytes pool registering bucket capacity=824638690072, size=824638690080, refillLowWatermark=%!f(*float64=0x275e4f0), refillHighWatermark=%!f(*float64=0x275e4e8)
15:31:29.172321[I] bytes pool %!s(*config.PoolingType=<nil>) init
15:31:30.586974[I] creating dynamic config service client with m3cluster
15:31:30.587371[W] could not load cache from file /data1/m3db/cache/_kv_default_env_m3db_embedded.json: error opening cache file /data1/m3db/cache/_kv_default_env_m3db_embedded.json: open /data1/m3db/cache/_kv_default_env_m3db_embedded.json: no such file or directory
15:31:30.587401[I] waiting for dynamic topology initialization, if this takes a long time, make sure that a topology/placement is configured
15:31:30.587409[I] adding a watch for service: m3db env: default_env zone: embedded includeUnhealthy: true
15:31:30.587439[W] could not load cache from file /data1/m3db/cache/m3db_embedded.json: error opening cache file /data1/m3db/cache/m3db_embedded.json: open /data1/m3db/cache/m3db_embedded.json: no such file or directory
15:31:30.589368[W] error creating cache file /data1/m3db/cache/m3db_embedded.json: open /data1/m3db/cache/m3db_embedded.json: no such file or directory
15:31:30.589418[E] failed to write cache file [{error invalid cache file: /data1/m3db/cache/m3db_embedded.json}]
15:31:30.589835[I] initial topology / placement value received
15:31:40.599408[E] error initializing namespaces values, retrying in the background [{key /namespaces} {error initializing value error:init watch timeout}]
15:31:40.672499[I] received kv update with version 1 for key /placement
15:31:40.673077[I] election manager opened successfully
15:31:40.706882[I] cluster database initializing topology
15:31:40.706910[I] cluster database resolving topology
15:31:40.706919[I] cluster database resolved topology
15:31:40.732239[I] creating namespaces watch
15:31:40.732285[I] waiting for dynamic namespace registry initialization, if this takes a long time, make sure that a namespace is configured
15:31:40.734772[I] initial namespace value received
15:31:40.734831[W] error creating cache file /data1/m3db/cache/_kv_default_env_m3db_embedded.json: open /data1/m3db/cache/_kv_default_env_m3db_embedded.json: no such file or directory
15:31:40.734950[E] failed to write cache file [{error invalid cache file: /data1/m3db/cache/_kv_default_env_m3db_embedded.json}]
15:31:40.735021[I] resolving namespaces with namespace watch
15:31:40.735099[I] updating database namespaces [{adds [default, aggregated]} {updates []} {removals []}]
15:31:41.237404[I] node tchannelthrift: listening on 0.0.0.0:9000
15:31:41.238037[I] cluster tchannelthrift: listening on 0.0.0.0:9001
15:31:41.673352[I] election state changed from follower to leader
15:31:41.696342[I] node httpjson: listening on 0.0.0.0:9002
15:31:41.696521[I] cluster httpjson: listening on 0.0.0.0:9003
15:31:41.697200[I] bootstrapping shards for range starting [{run bootstrap-data} {bootstrapper base} {namespace default} {numShards 192} {from 1999-11-29 08:00:00 +0800 CST} {to 2019-03-04 08:00:00 +0800 CST} {range 168840h0m0s}]
15:31:41.698803[I] bootstrapping from source starting [{source filesystem} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 192}]
15:31:41.698933[I] bootstrapping from source completed successfully [{source filesystem} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 192} {took 95.37µs} {numSeries 0}]
15:31:41.699579[I] bootstrapping from source starting [{source commitlog} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 0}]
15:31:41.699604[I] bootstrapping from source completed successfully [{source commitlog} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 0} {took 1.958µs} {numSeries 0}]
15:31:41.700590[I] bootstrapping from source starting [{source peers} {namespace default} {from 1999-11-29 08:00:00 +0800 CST} {to 2019-03-04 08:00:00 +0800 CST} {range 168840h0m0s} {shards 192}]
15:31:41.700745[I] peers bootstrapper resolving block retriever [{namespace default}]
15:31:42.277130[I] successfully updated topology to 3 hosts
15:31:42.360669[I] peers bootstrapper bootstrapping shards for ranges [{shards 192} {concurrency 20} {shouldPersist true}]
15:31:42.371137[I] successfully updated topology to 3 hosts
15:33:02.885442[I] bootstrapping from source completed successfully [{source peers} {namespace default} {from 1999-11-29 08:00:00 +0800 CST} {to 2019-03-04 08:00:00 +0800 CST} {range 168840h0m0s} {shards 192} {took 1m21.184820645s} {numSeries 0}]
15:33:02.886731[I] bootstrapping shards for range completed successfully [{run bootstrap-data} {bootstrapper base} {namespace default} {numShards 192} {from 1999-11-29 08:00:00 +0800 CST} {to 2019-03-04 08:00:00 +0800 CST} {range 168840h0m0s} {took 1m21.189477095s}]
15:33:02.886770[I] bootstrapping shards for range starting [{run bootstrap-data} {bootstrapper base} {namespace default} {numShards 192} {from 2019-03-04 08:00:00 +0800 CST} {to 2019-03-11 08:00:00 +0800 CST} {range 168h0m0s}]
15:33:17.325794[I] bootstrapping from source starting [{source filesystem} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 192}]
15:33:17.325912[I] bootstrapping from source completed successfully [{source filesystem} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 192} {took 69.249µs} {numSeries 0}]
15:33:17.326284[I] bootstrapping from source starting [{source commitlog} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 0}]
15:33:17.326300[I] bootstrapping from source completed successfully [{source commitlog} {namespace default} {from 0001-01-01 00:00:00 +0000 UTC} {to 0001-01-01 00:00:00 +0000 UTC} {range 0s} {shards 0} {took 1.07µs} {numSeries 0}]
15:33:17.326809[I] bootstrapping from source starting [{source peers} {namespace default} {from 2019-03-04 08:00:00 +0800 CST} {to 2019-03-11 08:00:00 +0800 CST} {range 168h0m0s} {shards 192}]
15:33:17.326941[I] peers bootstrapper bootstrapping shards for ranges [{shards 192} {concurrency 40} {shouldPersist false}]
15:33:17.359210[I] bootstrapping from source completed successfully [{source peers} {namespace default} {from 2019-03-04 08:00:00 +0800 CST} {to 2019-03-11 08:00:00 +0800 CST} {range 168h0m0s} {shards 192} {took 32.367118ms} {numSeries 0}]
15:33:17.360524[I] bootstrapping shards for range completed successfully [{run bootstrap-data} {bootstrapper base} {namespace default} {numShards 192} {from 2019-03-04 08:00:00 +0800 CST} {to 2019-03-11 08:00:00 +0800 CST} {range 168h0m0s} {took 14.473733149s}]
15:33:17.360580[I] bootstrapping shards for range starting [{run bootstrap-index} {bootstrapper base} {namespace default} {numShards 192} {from 1999-11-29 08:00:00 +0800 CST} {to 2019-03-04 08:00:00 +0800 CST} {range 168840h0m0s}]
15:33:42.605290[I] bootstrapping from source starting [{source filesystem} {namespace default} {from 1999-11-29 08:00:00 +0800 CST} {to 2019-03-04 08:00:00 +0800 CST} {range 168840h0m0s} {shards 192}]

strace m3dbnode process

###
ead(331, "\10\1\316\21", 4)            = 4
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1169424000000000000-digest.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1169424000000000000-digest.db", {st_mode=S_IFREG|0664, st_size=20, ...}, 0) = 0
read(331, "\21\16ht\1\0\0\0\1\0\0\0\1\0\10\0\1\0\0\0", 128) = 20
read(331, "", 128)                      = 0
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1169424000000000000-info.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1169424000000000000-info.db", {st_mode=S_IFREG|0664, st_size=48, ...}, 0) = 0
read(331, "\1\222\2\231\317\20:\240\356\237\321\0\0\317\0\2&\17\371)\0\0\0\1\221\0\222\1\323\200\0\0"..., 128) = 48
read(331, "", 128)                      = 0
close(331)                              = 0
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170028800000000000-checkpoint.db", {st_mode=S_IFREG|0664, st_size=4, ...}, 0) = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170028800000000000-checkpoint.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b5c4) = -1 EPERM (Operation not permitted)
read(331, "_\1\222\30", 4)              = 4
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170028800000000000-digest.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170028800000000000-digest.db", {st_mode=S_IFREG|0664, st_size=20, ...}, 0) = 0
read(331, "k\16W\202\1\0\0\0\1\0\0\0\1\0\10\0\1\0\0\0", 128) = 20
read(331, "", 128)                      = 0
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170028800000000000-info.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170028800000000000-info.db", {st_mode=S_IFREG|0664, st_size=48, ...}, 0) = 0
read(331, "\1\222\2\231\317\20<\306\376\230\372\0\0\317\0\2&\17\371)\0\0\0\1\221\0\222\1\323\200\0\0"..., 128) = 48
read(331, "", 128)                      = 0
close(331)                              = 0
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170633600000000000-checkpoint.db", {st_mode=S_IFREG|0664, st_size=4, ...}, 0) = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170633600000000000-checkpoint.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b5c4) = -1 EPERM (Operation not permitted)
read(331, "\270\1\312\37", 4)           = 4
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170633600000000000-digest.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170633600000000000-digest.db", {st_mode=S_IFREG|0664, st_size=20, ...}, 0) = 0
read(331, "\307\f\226B\1\0\0\0\1\0\0\0\1\0\10\0\1\0\0\0", 128) = 20
read(331, "", 128)                      = 0
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170633600000000000-info.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1170633600000000000-info.db", {st_mode=S_IFREG|0664, st_size=48, ...}, 0) = 0
read(331, "\1\222\2\231\317\20>\355\16\222#\0\0\317\0\2&\17\371)\0\0\0\1\221\0\222\1\323\200\0\0"..., 128) = 48
read(331, "", 128)                      = 0
close(331)                              = 0
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171238400000000000-checkpoint.db", {st_mode=S_IFREG|0664, st_size=4, ...}, 0) = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171238400000000000-checkpoint.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b5c4) = -1 EPERM (Operation not permitted)
read(331, "\21\1\335\22", 4)            = 4
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171238400000000000-digest.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171238400000000000-digest.db", {st_mode=S_IFREG|0664, st_size=20, ...}, 0) = 0
read(331, "\"\f\257'\1\0\0\0\1\0\0\0\1\0\10\0\1\0\0\0", 128) = 20
read(331, "", 128)                      = 0
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171238400000000000-info.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171238400000000000-info.db", {st_mode=S_IFREG|0664, st_size=48, ...}, 0) = 0
read(331, "\1\222\2\231\317\20A\23\36\213L\0\0\317\0\2&\17\371)\0\0\0\1\221\0\222\1\323\200\0\0"..., 128) = 48
read(331, "", 128)                      = 0
close(331)                              = 0
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171843200000000000-checkpoint.db", {st_mode=S_IFREG|0664, st_size=4, ...}, 0) = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171843200000000000-checkpoint.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b5c4) = -1 EPERM (Operation not permitted)
read(331, "h\1\241\31", 4)              = 4
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171843200000000000-digest.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171843200000000000-digest.db", {st_mode=S_IFREG|0664, st_size=20, ...}, 0) = 0
read(331, "|\f\2365\1\0\0\0\1\0\0\0\1\0\10\0\1\0\0\0", 128) = 20
read(331, "", 128)                      = 0
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171843200000000000-info.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1171843200000000000-info.db", {st_mode=S_IFREG|0664, st_size=48, ...}, 0) = 0
read(331, "\1\222\2\231\317\20C9.\204u\0\0\317\0\2&\17\371)\0\0\0\1\221\0\222\1\323\200\0\0"..., 128) = 48
read(331, "", 128)                      = 0
close(331)                              = 0
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1172448000000000000-checkpoint.db", {st_mode=S_IFREG|0664, st_size=4, ...}, 0) = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1172448000000000000-checkpoint.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b5c4) = -1 EPERM (Operation not permitted)
read(331, "\277\1e ", 4)                = 4
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1172448000000000000-digest.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)
newfstatat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1172448000000000000-digest.db", {st_mode=S_IFREG|0664, st_size=20, ...}, 0) = 0
read(331, "\326\f\215C\1\0\0\0\1\0\0\0\1\0\10\0\1\0\0\0", 128) = 20
read(331, "", 128)                      = 0
close(331)                              = 0
openat(AT_FDCWD, "/data1/m3db/data/default/116/fileset-1172448000000000000-info.db", O_RDONLY|O_CLOEXEC) = 331
epoll_ctl(4, EPOLL_CTL_ADD, 331, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1031881072, u64=139922476450160}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 331, 0xc096f7b544) = -1 EPERM (Operation not permitted)

####

futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 297388}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0009064c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0004cb9c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 2878651}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0009064c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0004cb9c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 1225293}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0009064c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc00050a140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 5025136}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0009064c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc00050a140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 364764034}) = 0
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 4947303}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0009064c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0004cb9c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 210810}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc00050a140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0004cb9c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 862202}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc166a18140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 1820061}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0004cbd40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0004ca4c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 1197135}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0004ca4c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0004cbd40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 243082656}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc00095e840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc166a18140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x27c4520, FUTEX_WAIT_PRIVATE, 0, {0, 1099108}) = -1 ETIMEDOUT (Connection timed out)

### 

futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x27bfd20, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
naughtyGitCat commented 5 years ago

problem node config

db:
  logging:
    level: info

  metrics:
    prometheus:
      handlerPath: /metrics
    sanitization: prometheus
    samplingRate: 1.0
    extended: detailed

  hostID:
    resolver: config
    value: SATA001

  config:
    service:
      env: default_env
      zone: embedded
      service: m3db
      cacheDir: /data1/m3db/cache
      etcdClusters:
        - zone: embedded
          endpoints:
            - 127.0.0.1:2379
  listenAddress: 0.0.0.0:9000
  clusterListenAddress: 0.0.0.0:9001
  httpNodeListenAddress: 0.0.0.0:9002
  httpClusterListenAddress: 0.0.0.0:9003
  debugListenAddress: 0.0.0.0:9004

  client:
    writeConsistencyLevel: majority
    readConsistencyLevel: unstrict_majority

  gcPercentage: 100

  writeNewSeriesAsync: true
  writeNewSeriesLimitPerSecond: 1048576
  writeNewSeriesBackoffDuration: 2ms

  bootstrap:
    bootstrappers:
        - filesystem
        - commitlog
        - peers
        - uninitialized_topology
    fs:
        numProcessorsPerCPU: 0.125

  cache:
    series:
      policy: lru
    postingsList:
      size: 262144

  commitlog:
    flushMaxBytes: 524288
    flushEvery: 1s
    blockSize: 10m
    queue:
        calculationType: fixed
        size: 2097152

  fs:
    filePathPrefix: /data1/m3db
richardartoul commented 5 years ago

@naughtyGitCat Whats your namespace configuration? The messed up node is trying to bootstrap all the way back from 1973 :/ Do you have some kind of crazy long retention, or is a zero value sneaking in there somehow?

naughtyGitCat commented 5 years ago

thanks to Evan, I turn the shards num from 192 to 48, then the cluster boots up successfully. perhaps my SAS disk doesn`t fit this config.