chunk allocation with rack awareness

szabolcsf commented 5 years ago

We have a ~25PB qfs 2.0.0 cluster with rackId configured on the chunkservers. Our physical servers have several disks, so we have multiple chunkservers per physical server. For this reason each physical server have a unique rackId.

We allocate one primary + one replica for every chunk. The goal is that every chunk should survive a complete failure of any physical server.

But somehow both primary and replica chunks end up on the same physical server, i.e. the same rackId.

This is our metaserver config:

metaServer.clientPort = 20000
metaServer.chunkServerIp = 10.10.1.1
metaServer.chunkServerPort = 20100
metaServer.clusterKey = prod
metaServer.cpDir = /home/qfs/meta/checkpoints
metaServer.logDir = /home/qfs/meta/logs
metaServer.createEmptyFs = 0
metaServer.recoveryInterval = 1
metaServer.msgLogWriter.logLevel = INFO
metaServer.msgLogWriter.maxLogFileSize = 100e09
metaServer.msgLogWriter.maxLogFiles = 3
metaServer.minChunkservers = 1
metaServer.clientThreadCount = 5
metaServer.maxClientCount = 131072
metaServer.clientSM.maxPendingOps = 51200
metaServer.chunkServer.chunkAllocTimeout = 1500
metaServer.chunkServer.chunkReallocTimeout = 2000
metaServer.clientSM.inactivityTimeout = 1500
metaServer.clientSM.auditLogging = 1
metaServer.rootDirUser = 1002
metaServer.rootDirGroup = 1003
metaServer.rootDirMode = 0777
metaServer.maxSpaceUtilizationThreshold = 0.95
metaServer.serverDownReplicationDelay = 7200
metaServer.maxConcurrentReadReplicationsPerNode = 8
chunkServer.storageTierPrefixes = /disk 10
chunkServer.bufferedIo = 0
metaServer.maxRebalanceSpaceUtilThreshold = 0.94
metaServer.minRebalanceSpaceUtilThreshold = 0.93
metaServer.chunkServer.heartbeatInterval = 15
metaServer.sortCandidatesBySpaceUtilization = 0
metaServer.rebalancingEnabled = 0
metaServer.MTimeUpdateResolution = 60
metaServer.msgLogWriter.logFilePrefixes = MetaServer.log
metaServer.maxWritesPerDriveRatio = 3
metaServer.sortCandidatesByLoadAvg = 1

and this is a chunkserver config:

chunkServer.metaServer.port = 20100
chunkServer.clientIp = 10.10.1.25
chunkServer.clientPort = 21011
chunkServer.clusterKey = prod
chunkServer.rackId = 681000
chunkServer.chunkDir = /mnt/n01/qfs /mnt/n02/qfs /mnt/n03/qfs /mnt/n04/qfs /mnt/n05/qfs /mnt/n06/qfs
chunkServer.diskIo.crashOnError = 0
chunkServer.abortOnChecksumMismatchFlag = 1
chunkServer.msgLogWriter.logLevel = DEBUG
chunkServer.msgLogWriter.maxLogFileSize = 1e9
chunkServer.msgLogWriter.maxLogFiles = 2
chunkServer.diskQueue.threadCount = 5
chunkServer.ioBufferPool.partitionBufferCount = 1572864
chunkServer.bufferedIo = 0
chunkServer.dirRecheckInterval = 864000
chunkServer.maxSpaceUtilizationThreshold = 0.05

So for instance other chunkservers on this exact same physical server also have the 681000 rackId. So far it happened several times that a physical server died and we've lost chunks, because they were on the same physical server, although assigned to a different chunkserver within that same physical server.

Could you please take a look at our configs and see if we are doing something wrong?

szabolcsf commented 5 years ago

From qfsfsck output: Chunks reachable no rack assigned: 226257930 100% Does that mean there's no rack assigned to any of the chunks?

mikeov commented 5 years ago

Rack IDs outside the range from 0 to 65535 are considered invalid, and ignored by chunk placement logic. Presently only rack IDs specified by metaServer.rackPrefixes parameter are validated, and the error message emitted in the case when rack id is outside valid range.

In the case if all chunk server rack IDs are outside valid range the FSCK will report all chunks as with no rack ID assigned.

Present design assumes that the number of racks (failure groups) is reasonably small less than a 100 or so.

I’d recommend to use one chunk server per physical node / host, with adequate number of network IO (“client”) and disk IO threads. By default the number of IO threads is 2 per chunk directory / IO device / “disk”. Chunk server annotated configuration file https://github.com/quantcast/qfs/blob/master/conf/ChunkServer.prp describes corresponding parameters chunkServer.clientThreadCount, chunkServer.diskQueue.threadCount, and offers some insights of how to set them.

szabolcsf commented 5 years ago

Thank you @mikeov, this is very useful! We are going to fix the rackids in the chunk config and see how the chunk placement goes. Ftr, we don't have 681000 racks in the cluster, we just did some multiplication to ensure uniqueness. We are going to use real rackid now (one id per rack) and it will be within the 0 to 65535 range.

szabolcsf commented 5 years ago

Closing this as resolved.

quantcast / qfs

chunk allocation with rack awareness #235