quantcast / qfs

Quantcast File System
https://quantcast.atlassian.net
Apache License 2.0
643 stars 171 forks source link

chunkservers get disconnected #230

Open szabolcsf opened 6 years ago

szabolcsf commented 6 years ago

We have deployed qfs 2.0 on a cluster of 200 dedicated servers with Debian 9 x86-64. Each have 10 hard drives, we're running one chunkserver per hard drive. We only use one metaserver on a separate server. These servers have nothing else running on them. They don't suffer from high load or any resource bottlenecks.

Now there's a problem: every time we start writing to the cluster chunkservers get temporarily disconnected and end up in the dead nodes history. It doesn't matter how much data we write, a few megabytes are enough to trigger the problem.

This is how it looks like from the metaserver: 08-07-2018 01:08:34.061 DEBUG - (NetConnection.cc:108) netconn: 1262 read: Connection reset by peer 104 08-07-2018 01:08:34.061 ERROR - (ChunkServer.cc:1304) 10.10.1.5 21014 / 10.10.1.5:42806 chunk server down reason: communication error socket: good: 0 status: Connection reset by peer 104 -104 08-07-2018 01:08:34.061 INFO - (LayoutManager.cc:6019) server down: 10.10.1.5 21014 block count: 36595 master: 0 replay: 0 reason: communication error; Connection reset by peer 104 chunk-server-bye: 10.10.1.5 21014 logseq: 0 0 893823745 chunks: 36595 checksum: 0 2030060475277 36752 log: in flight: 0 repl delay: 7200 completion: no

and this is how it looks like on a chunkserver: 08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:592) 10.10.1.3 20100 meta server inactivity timeout, last request received: 17 secs ago timeout: inactivity: 40 receive: 16 08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:1043) 10.10.1.3 20100 closing meta server connection due to receive timeout

The issue is easy to reproduce, a simple qfscat | qfsput triggers it. When we stop all write activity, there's no disconnect error as long as there's no write to the cluster. I.e. if we don't write for days, there's no chunkserver error for days.

One thing to note: when the disconnect error occurs, only one single chunkserver from a given node is disconnected, the remaining 9 chunkservers running on that node are fine. There's no pattern to which chunkservers get disconnected, meaning, it's always random. One node node it's chunkserver01, on another node it's chunckserver09 and so on.

When there's write activity, all nodes are affected with the disconnect error. The issue is almost evenly spread on nodes, doesn't depend on rack location. No nodes have strangely low or high amount of triggered errors compared to others.

None of the nodes have any lost TX/RX packets. There are no hardware issues and networking works properly.

This is a tcpdump packet capture about the issue: 891162 2018-08-14 13:57:21.006309 10.10.1.3 → 10.10.1.100 TCP 122 20100 → 46630 [PSH, ACK] Seq=53247 Ack=2318595 Win=32038 Len=52 TSval=873627156 TSecr=3858378234 891163 2018-08-14 13:57:21.006469 10.10.1.100 → 10.10.1.3 TCP 70 46630 → 20100 [RST, ACK] Seq=2318595 Ack=53299 Win=18888 Len=0 TSval=3858387169 TSecr=873627156 891164 2018-08-14 13:57:21.006540 10.10.1.100 → 10.10.1.3 TCP 78 45238 → 20100 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3858387169 TSecr=0 WS=2 891165 2018-08-14 13:57:21.006608 10.10.1.3 → 10.10.1.100 TCP 78 20100 → 45238 [SYN, ACK, ECN] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=873627157 TSecr=3858387169 WS=2

We have been using qfs 1.2 on a similarly sized cluster with no problem, no such chunkserver disconnect errors occurred.

Is this a regression in qfs 2.0? What can we do to help this troubleshooting and get fixed?

mikeov commented 6 years ago

I’d suggest to try to reproduce the problem on a single lode setup with two chunk servers and one meta server.

Top level makefile has test target that creates and runs such setup. The test fails in case if any chunk server disconnects during the test. This test runs successfully on travis.

I’ve run a few tests with few different settings, and was not able to reproduce the problem that you’re describing.

Another possible experiment that comes to mind might be to run iperf between the client and chunk server nodes to see if iperf would cause chunk server disconnects.

2.0 code presently deployed on the Quantcast production cluster. I have not heard any reports of chunk servers spurious disconnects, or disconnects that appear to be correlated with IO activity.

2.0 code might be more sensitive to the temporary network outage / latencies with default configuration. This is needed to minimize meta server primary / backup failover time. The meta server CPU cost of chunk server re-connect with 2.0 code is a small fraction of prior QFS versions, as typically only partial chunk server inventory synchronization with meta server is required on re-connect.

— Mike.

szabolcsf commented 6 years ago

Thanks for your suggestions.

make test did not reveal any issue, all tests completed successfully.

Also tried iperf. We've got 10G fiber links between our nodes, so ran iperf -c 10.10.1.3 -t 600 -P 10 on all nodes at the same time (iperf server was running on the same node as qfs metaserver). I was able to trigger connection reset by peer one single time with this test. However, I couldn't reproduce it later, even with longer period and more parallel processes. On the other hand, doing qfscat | qfsput produces the issue very quickly and reliably.

Is there a way to set the metaserver/chunkserver configuration to be less sensitive to network latency? There are no dropped packets, so doesn't seem like an outage, maybe some minor latency.

szabolcsf commented 6 years ago

I'm experimenting with timeout settings in MetaServer using this reference: https://github.com/quantcast/qfs/blob/master/conf/MetaServer.prp This is what I have now:

metaServer.chunkServer.chunkAllocTimeout = 1500
metaServer.chunkServer.chunkReallocTimeout = 2000
metaServer.clientSM.inactivityTimeout = 1500
metaServer.chunkServer.heartbeatTimeout = 160
metaServer.chunkServer.makeStableTimeout = 660
metaServer.chunkServer.requestTimeout = 1200

but still getting the Connection reset by peer errors which look like this on the ChunkServers:

08-27-2018 07:47:06.541 ERROR - (MetaServerSM.cc:592) 10.10.1.3 20100 meta server inactivity timeout, last request received: 17 secs ago timeout: inactivity: 40 receive: 16 
08-27-2018 07:47:06.541 ERROR - (MetaServerSM.cc:1043) 10.10.1.3 20100 closing meta server connection due to receive timeout

Last request received is always 17 secs. Is that hardcoded somewhere? Or did I miss something in the timeout settings?

mikeov commented 6 years ago

I’d suggest to add the following parameter to the meta server configuration file:

metaServer.chunkServer.minInactivityInterval = 16

and revert timeout parameters to defaults by changing the parameters values in the meta server configuration file to the following:

metaServer.chunkServer.chunkAllocTimeout = 40 metaServer.chunkServer.chunkReallocTimeout = 75 metaServer.clientSM.inactivityTimeout = 480 metaServer.chunkServer.heartbeatTimeout = 60 metaServer.chunkServer.makeStableTimeout = 330 metaServer.chunkServer.requestTimeout = 600

and sending HUP signal to the meta server process, or removing these parameters and restarting meta server.

szabolcsf commented 6 years ago

Thanks, these settings kind of work. Now there are no chunkserver disconnect errors, nodes don't end up in the dead nodes history. But there are some write errors that complain about invalid write id:

$ qfscat -s 10.10.1.255 -p 20000 /bigfile | qfsput -s 10.10.1.185 -p 22000 -f /bigfile
08-30-2018 13:28:57.538 INFO - (Writer.cc:1383) PW 1,6,2,bigfile scheduling retry: 1 of 6 in 4 sec. op: write-prepare: chunkid: 33506 version: 1 offset: 34603008 numBytes: 1048576 checksum: 3941637200 location: 10.10.1.186 22003 writeId: 1280295217958875272  location: 10.10.1.187 22004 writeId: 972026391904307804  location: 10.10.1.188 22009 writeId: 3779866767468343402 
08-30-2018 14:03:28.472 ERROR - (Writer.cc:1320) PW 1,6,2,bigfile operation failure, seq: 2010396268510501266 status: -22 msg: invalid write id op: write-prepare: chunkid: 33741 version: 1 offset: 7340032 numBytes: 1048576 checksum: 3897471392 location: 10.10.1.188 22001 writeId: 4029286010832404040  location: 10.10.1.186 22001 writeId: 3717269637481429138  location: 10.10.1.187 22001 writeId: 3180519620600792290  current chunk server: 10.10.1.188 22001 chunkserver: all data sent
Request:
WRITE_PREPARE
Cseq: 2010396268510501266
Version: KFS/1.0
Client-Protocol-Version: 114
UserId: 1000
GroupId: 1000
User: team
Chunk-handle: 33741
Chunk-version: 1
Offset: 7340032
Num-bytes: 1048576
Checksum: 3897471392
Checksum-entries: 0
Reply: 1
Num-servers: 3
Servers:10.10.1.188 22001 4029286010832404040 10.10.1.186 22001 3717269637481429138 10.10.1.187 22001 3180519620600792290 

08-30-2018 14:03:28.472 INFO - (Writer.cc:1383) PW 1,6,2,bigfile scheduling retry: 1 of 6 in 4 sec. op: write-prepare: chunkid: 33741 version: 1 offset: 7340032 numBytes: 1048576 checksum: 3897471392 location: 10.10.1.188 22001 writeId: 4029286010832404040  location: 10.10.1.186 22001 writeId: 3717269637481429138  location: 10.10.1.187 22001 writeId: 3180519620600792290 

Does this mean some ChunkServers are still temporarily lost?

I've done several different tests and so far. One of them was setting up a qfs 2.0 test cluster on servers that we have been using with qfs 1.2 with no problem (no chunkserver disconnect, no write error). The write errors occurred on those servers as well.

For the record, this is our full qfs 2.0 MetaServer config.prp:

metaServer.clientPort = 22000
metaServer.chunkServerIp = 10.10.1.185
metaServer.chunkServerPort = 22100
metaServer.clusterKey = qfs2_test
metaServer.cpDir = /home/qfs/qfs_test/meta/checkpoints
metaServer.logDir = /home/qfs/qfs_test/meta/logs
metaServer.createEmptyFs = 0
metaServer.recoveryInterval = 1
metaServer.msgLogWriter.logLevel = DEBUG
metaServer.msgLogWriter.maxLogFileSize = 100e09
metaServer.msgLogWriter.maxLogFiles = 3
metaServer.minChunkservers = 1
metaServer.clientThreadCount = 18
metaServer.clientSM.maxPendingOps = 51200
metaServer.clientSM.auditLogging = 1
metaServer.rootDirUser = 1002
metaServer.rootDirGroup = 1003
metaServer.rootDirMode = 0777
metaServer.maxSpaceUtilizationThreshold = 0.95
metaServer.serverDownReplicationDelay = 7200
metaServer.maxConcurrentReadReplicationsPerNode = 8
chunkServer.storageTierPrefixes = /disk 10
chunkServer.bufferedIo = 0
metaServer.chunkServer.minInactivityInterval = 16
metaServer.chunkServer.chunkAllocTimeout = 40
metaServer.chunkServer.chunkReallocTimeout = 75
metaServer.clientSM.inactivityTimeout = 480
metaServer.chunkServer.heartbeatTimeout = 60
metaServer.chunkServer.makeStableTimeout = 330
metaServer.chunkServer.requestTimeout = 600

and this is how our ChunkServer config.prp files look like:

chunkServer.metaServer.port = 22100
chunkServer.clientIp = 10.10.1.186
chunkServer.clientPort = 22001
chunkServer.clusterKey = qfs2_test
chunkServer.rackId = 6811190
chunkServer.chunkDir = /mnt/01/qfs2_test
chunkServer.diskIo.crashOnError = 0
chunkServer.abortOnChecksumMismatchFlag = 1
chunkServer.msgLogWriter.logLevel = DEBUG
chunkServer.msgLogWriter.maxLogFileSize = 1e9
chunkServer.msgLogWriter.maxLogFiles = 2
chunkServer.diskQueue.threadCount = 5
chunkServer.ioBufferPool.partitionBufferCount = 1572864
chunkServer.bufferedIo = 0
chunkServer.dirRecheckInterval = 864000
chunkServer.maxSpaceUtilizationThreshold = 0.05

Can this also be related to qfs 2.0 being more sensitive to latency? Can we adjust our settings to avoid the write errors?

mikeov commented 6 years ago

Qfscat message log time stamps suggest that 1MB transfer took more than 30 minutes, i.e., the average transfer rate was less than 10485768/(3060) = 4.66 Kbps. Given than chunk server and client network inactivity timeouts are less than few minutes this suggest that the data was slowly “trickling” from that client to the chunk server.

The QFS chunk [write] lease timeout is 5 minutes. The chunk server renews chunk lease after it successfully receives chunk write RPC from QFS client. [At 10 Mbps 1MB RPC network transfer would take slightly over 1 second.]

In this case chunk write RPC network transfer took more than 30 minutes, the chunk write lease timed out, and the client had to “re-allocate” the chunk and re-issue the RPC.

While it is possible to change chunk lease timeout and rebuild the code, I think that more changes would be required in order to make the code work with very low bandwidth network transfers. The present code assumption is that the available bandwidth between client and chunk servers is at lest on order of 10 Mbps.

Assuming typical modern data center deployment with at least 1Gbps network, the results of qfscat experiment might suggest a network HW problem.

At the moment I can only think of the following QFS client change that could be relevant. In theory (and in my testing) the change would increase network utilization and write throughput by pipelining write RPCs:

https://github.com/quantcast/qfs/commit/d7d5c50a04440eb7afaf1523310389f237d85510

The change isn’t 2.0 specific, but is only in 2.x releases. This change was deployed in production at Quantcast more than a year ago. So far I have not heard any reports that would suggest that the change is causing problems.

If the underlying problem in this case a HW problem, then change in transfer bandwidth and / or timing could have effect on the frequency of the problem manifestation — abnormally low transfer rate in this case.