Closed szabolcsf closed 5 years ago
From qfsfsck output: Chunks reachable no rack assigned: 226257930 100%
Does that mean there's no rack assigned to any of the chunks?
Rack IDs outside the range from 0 to 65535 are considered invalid, and ignored by chunk placement logic. Presently only rack IDs specified by metaServer.rackPrefixes parameter are validated, and the error message emitted in the case when rack id is outside valid range.
In the case if all chunk server rack IDs are outside valid range the FSCK will report all chunks as with no rack ID assigned.
Present design assumes that the number of racks (failure groups) is reasonably small less than a 100 or so.
I’d recommend to use one chunk server per physical node / host, with adequate number of network IO (“client”) and disk IO threads. By default the number of IO threads is 2 per chunk directory / IO device / “disk”. Chunk server annotated configuration file https://github.com/quantcast/qfs/blob/master/conf/ChunkServer.prp describes corresponding parameters chunkServer.clientThreadCount, chunkServer.diskQueue.threadCount, and offers some insights of how to set them.
Thank you @mikeov, this is very useful! We are going to fix the rackids in the chunk config and see how the chunk placement goes. Ftr, we don't have 681000 racks in the cluster, we just did some multiplication to ensure uniqueness. We are going to use real rackid now (one id per rack) and it will be within the 0 to 65535 range.
Closing this as resolved.
We have a ~25PB qfs 2.0.0 cluster with rackId configured on the chunkservers. Our physical servers have several disks, so we have multiple chunkservers per physical server. For this reason each physical server have a unique rackId.
We allocate one primary + one replica for every chunk. The goal is that every chunk should survive a complete failure of any physical server.
But somehow both primary and replica chunks end up on the same physical server, i.e. the same rackId.
This is our metaserver config:
and this is a chunkserver config:
So for instance other chunkservers on this exact same physical server also have the
681000
rackId. So far it happened several times that a physical server died and we've lost chunks, because they were on the same physical server, although assigned to a different chunkserver within that same physical server.Could you please take a look at our configs and see if we are doing something wrong?