quantcast / qfs

Quantcast File System
https://quantcast.atlassian.net
Apache License 2.0
643 stars 171 forks source link

Meta server replication (VR) - ERROR (KfsNetClient.cc:736) #228

Closed cryptobum closed 6 years ago

cryptobum commented 6 years ago

I try use VR with 3 meta nodes. ID 0 start is ok. But when start last nodes i see error:

ERROR - (KfsNetClient.cc:736) closing connection: not connected to: 192.168.233.116 20000 due to network error pending: read: 0 write: 0 ops: 1 auth failures: 0 error:

and qfsadmin return error when i ping this nodes

ERROR - (qfsadmin_main.cc:277) retry limit reached error: Unknown error 10110 10110

What could be my mistake?

FULL RETURN # metaserver MetaServer.prp

loading key metaServer.clientPort with value 20000
loading key metaServer.chunkServerPort with value 30000
loading key metaServer.logDir with value /opt/qfs/conf/meta/transaction_logs
loading key metaServer.cpDir with value /opt/qfs/conf/meta/checkpoint
loading key metaServer.recoveryInterval with value 30
loading key metaServer.clusterKey with value my-fs-unique-identifier
loading key metaServer.msgLogWriter.logLevel with value INFO
loading key chunkServer.msgLogWriter.logLevel with value NOTICE
loading key metaServer.vr.id with value 1
loading key metaServer.vr.hostnameToId with value meta-0 0 meta-1 1 meta-2 2
loading key metaServer.metaMds with value 34427f3a1ee919554b7507370035421d
loading key metaServer.vr.ignoreInvalidVrState with value 1
loading key metaServer.metaDataSync.fileSystemId with value 371167773649703093
loading key metaServer.metaDataSync.servers with value 192.168.233.116 20000 192.168.233.117 20000 192.168.233.118 20000
loading key metaServer.metaDataSync.writeSync with value 1
loading key metaServer.log.receiver.listenOn with value 192.168.233.117  22222
loading key metaServer.exitOnRestart with value 1

INFO - (metaserver_main.cc:459) md5sum /proc/self/exe: 34427f3a1ee919554b7507370035421d
INFO - (nofilelimit.cc:82) max # of open files: 4096
INFO - (metaserver_main.cc:675) meta server client listener:  20000
INFO - (metaserver_main.cc:694) meta server chunk server listener:  30000
INFO - (metaserver_main.cc:702) path->fid cache disabled
INFO - (metaserver_main.cc:562) min chunk servers that should connect: 1
INFO - (metaserver_main.cc:570) min. # of replicas per file: 1
INFO - (metaserver_main.cc:739) hard limits: open files: 4096 chunk servers: 480 clients: 3360
INFO - (NetDispatch.cc:1434) socket limits: clients: 3150 log: receiver: 105 transmitter: 105
INFO - (LayoutManager.cc:2643) max. response size: 268435456 minIoBufferBytesToProcessRequest: 278921216
INFO - (LayoutManager.cc:3192) setting properties for 0 chunk servers: chunkServer.msgLogWriter.logLevel=NOTICE;
INFO - (metaserver_main.cc:761) starting metaserver
INFO - (MetaDataSync.cc:553) attempting to fetch checkpoint and logs from other node(s)
INFO - (MetaDataSync.cc:581) done fetching checkpoint and logs from: 192.168.233.116 20000
ERROR - (KfsNetClient.cc:736) closing connection: not connected to: 192.168.233.116 20000 due to network error pending: read: 0 write: 0 ops: 1 auth failures: 0 error:
INFO - (KfsNetClient.cc:2599) 192.168.233.116 20000 retry attempt 1 of 5, will retry 1 pending operation(s) in 4 seconds
INFO - (Restorer.cc:142) restoring from checkpoint of 2018-03-14T12:05:13.895102Z
INFO - (Replay.cc:634) open log file: /opt/qfs/conf/meta/transaction_logs/log.0.0.0.0 => /opt/qfs/conf/meta/transaction_logs/log.0.0.0.0
INFO - (Restorer.cc:142) restoring from checkpoint of 2018-03-14T12:05:13.895136Z
INFO - (metaserver_main.cc:962) updating space utilization
INFO - (metaserver_main.cc:966) replaying logs
INFO - (Replay.cc:634) open log file: /opt/qfs/conf/meta/transaction_logs/log.0.0.0.0 => /opt/qfs/conf/meta/transaction_logs/log.0.0.0.0
INFO - (Replay.cc:1344) log time: 2018-03-14T12:05:13.894290Z
WARN - (MetaVrSM.cc:963) transition into backup state with empty VR configuration, and node id non 0 node id: 1 primary: 0 active: 0
INFO - (LogTransmitter.cc:2001) update: primary: 0 tranmitters: 0 up: 0 ids up: 0 quorum: 0 committed: -1 -1 -1 ack: [-1 -1 -1,-1 -1 -1] ids: 0 => 0 up: 0 => 1
INFO - (LogWriter.cc:899) log append: idx: 0 start: 0 0 0 cur: 0 0 0 block: 0 hex: 1 file: /opt/qfs/conf/meta/transaction_logs/log.0.0.0.0 size: 109 checksum: 745ddf2aefcc172fb452c9cc3918fcc3
INFO - (metaserver_main.cc:774) start servicing
INFO - (LayoutManager.cc:13947) start servicing, primary: 0 servers: 0 replay: 0 disconnected: 0
INFO - (LayoutManager.cc:13984) stop servicing, primary: 0 servers: 0 replay: 0 disconnected: 0
NOTICE ...........
mikeov commented 6 years ago

I'd recommend to follow VR configuration procedure described in admin guide:

https://github.com/quantcast/qfs/wiki/Administrator's-Guide

cryptobum commented 6 years ago

Hi, Mike! Thanks for answer! A'm follow VR configuration procedure described in admin guide and have error. Can you look my config files maybe I missed something? Please.

node-0

metaServer.clientPort = 20000
metaServer.chunkServerPort = 30000
metaServer.logDir = /opt/qfs/conf/meta/transaction_logs
metaServer.cpDir = /opt/qfs/conf/meta/checkpoint
metaServer.recoveryInterval = 30
metaServer.clusterKey = my-fs-unique-identifier
metaServer.msgLogWriter.logLevel = INFO
chunkServer.msgLogWriter.logLevel = NOTICE
metaServer.vr.id = 0
metaServer.vr.hostnameToId = meta-0 0 meta-1 1 meta-2 2
metaServer.vr.syncVrStateFile = 1
metaServer.vr.ignoreInvalidVrState = 1
metaServer.metaDataSync.fileSystemId = 371167773649703093 
metaServer.metaDataSync.servers = 192.168.233.116 20000 192.168.233.117 20000 192.168.233.118 20000
metaServer.metaDataSync.writeSync = 1
metaServer.log.receiver.listenOn = 0.0.0.0  22222
metaServer.exitOnRestart = 1

node-1

metaServer.clientPort = 20000
metaServer.chunkServerPort = 30000
metaServer.logDir = /opt/qfs/conf/meta/transaction_logs
metaServer.cpDir = /opt/qfs/conf/meta/checkpoint
metaServer.recoveryInterval = 30
metaServer.clusterKey = my-fs-unique-identifier
metaServer.msgLogWriter.logLevel = INFO
chunkServer.msgLogWriter.logLevel = NOTICE
metaServer.vr.id = 1
metaServer.vr.hostnameToId = meta-0 0 meta-1 1 meta-2 2
metaServer.vr.syncVrStateFile = 1
metaServer.vr.ignoreInvalidVrState = 1
metaServer.metaDataSync.fileSystemId = 371167773649703093 
metaServer.metaDataSync.servers = 192.168.233.116 20000 192.168.233.117 20000 192.168.233.118 20000
metaServer.metaDataSync.writeSync = 1
metaServer.log.receiver.listenOn = 0.0.0.0  22222
metaServer.exitOnRestart = 1

node-2

metaServer.clientPort = 20000
metaServer.chunkServerPort = 30000
metaServer.logDir = /opt/qfs/conf/meta/transaction_logs
metaServer.cpDir = /opt/qfs/conf/meta/checkpoint
metaServer.recoveryInterval = 30
metaServer.clusterKey = my-fs-unique-identifier
metaServer.checkpoint.writeSync = 1
metaServer.msgLogWriter.logLevel = INFO
chunkServer.msgLogWriter.logLevel = NOTICE
metaServer.vr.id = 2
metaServer.vr.hostnameToId = meta-0 0 meta-1 1 meta-2 2
metaServer.vr.syncVrStateFile = 1
metaServer.vr.ignoreInvalidVrState = 1
metaServer.metaDataSync.fileSystemId = 371167773649703093
metaServer.metaDataSync.servers = 192.168.233.116 20000 192.168.233.117 20000 192.168.233.118 20000
metaServer.log.receiver.listenOn = 0.0.0.0 22222
metaServer.exitOnRestart = 1
mikeov commented 6 years ago

VR configuration stored in transaction log and checkpoint, as these are replicated to all active meta server nodes. This is done order to allow to reconfigure VR (add / remove meta server nodes) without downtime.

The qfs_admin is used to configure and re-configure VR. Typically qfs_admin error messages could be used to determine what the problem is.

Additional info, including test script that configures VR, can be found in the following discussion: https://github.com/quantcast/qfs/issues/223