quantcast / qfs

Quantcast File System
https://quantcast.atlassian.net
Apache License 2.0
643 stars 171 forks source link

question about getting chunk data #234

Open yangyangjuanjuan opened 6 years ago

yangyangjuanjuan commented 6 years ago

I read the code but did not find a function to get a chunk/block data. Is there a good way to read the block by given that block's information? The block information like: blocks: 2 BlockInfo: offset: 67108864 kfsChunkId_t: 262186 int64_t: 1 ServerLocation: 127.0.0.1 21001 chunkOff_t: 16777216

mikeov commented 6 years ago

QFS client read from a “logical” chunk position's would obviously fetch chunk data, though with replicated file there is no control which replica will be used.

KfsClient::CompareChunkReplicas() fetches all chunk replicas data and compares it. KfsClient::VerifyDataChecksums() fetches chunk checksum vectors and compares them, instead of the actual data.

These QFS client methods are used by qfsdataverify tool.

yangyangjuanjuan commented 6 years ago

Thanks for clarification. I asked this question because I hope to find a way load chunk locally. This would be a nice feature for downstream development based on qfs system. For example, I would like to assign task to the node which has data chunk in local, so it avoids transferring data by network. I found there are chunk folders on chunk server, and chunks are stored there. But they are not stored as plan text. Is there a function can read it? In addition, may I assume each block will end with a complete row (if I update a well formatted .csv file to qfs)? In another word, would there a row get spitted into two chunks? Thanks again.

mikeov commented 6 years ago

KfsClient::GetDataLocation() can be used to retrieve chunks / stripes location.

Chink / stripe boundaries are always at fixed positions / locations / offsets, i.e. they are independent of file content. In other words, the assumption that the row boundaries of csv file will coincide with chunk boundaries will not hold true. Small files are obvious special case where data fits in one chunk or in one stripe for striped files.

yangyangjuanjuan commented 6 years ago

Thanks for your reply @mikeov Would you give me some suggestions for the following case? I have QFS deployed on a cluster, and I want to do some map reduce on a data set which is composed by two chunks(blocks) in QFS, each chunk has three replicas. I want each map task assigned to a chunk server which has chunk's replicate locally. KfsClient::GetDataLocation() can be used to find right chunk server, but how can I load a local chunk on a chunk server? If I use KfsClient::Read(), will it always try the local replicate first (if there is)? Thanks again!

mikeov commented 6 years ago

QFS client attempts to read from the "local" (same node as client) chunk server when possible.