tsuna / gohbase

Pure-Go HBase client
Apache License 2.0
732 stars 211 forks source link

How to improve the performance of `scan` operation? #222

Closed fangxlmr closed 1 year ago

fangxlmr commented 1 year ago

Hi all, I encountered an performance issue of range-scanning hbase using gohbase.

Say, I will range-scan HBase using default options generated by hrpc.NewScanRangeStr, and say it will reponse 1000 in total. Later on I find out it too slow so I decided to tune some config option (e.g. batchSize), which means tune the number of returned results for each rpc call. But I'm unable to find out the option for this.

And I dig something else related:

  1. HBase do have an config option called hbase.client.scanner.caching (defaults to max.Int32) when client send the query to server. This option is most likely to be the one I want, but it obviously already is the max number.

hbase.client.scanner.caching Description Number of rows that we try to fetch when calling next on a scanner if it is not served from (local, client) memory. This configuration works together with hbase.client.scanner.max.result.size to try and use the network efficiently. The default value is Integer.MAX_VALUE ...

  1. In gohbase, caching field is set in the client scan proto, but ignored when init the rpc request. So the scan request is sent leaving caching empty. And no filter are available in pkg filter.

  2. HBase alse have an config option called Caching (defaults to 1) as well, which is used to controlling returning rows when query RegionServer

  3. In HBase codebase, I find the code snippet which handles the scan operation. It seems like client caching is not enforcing, whereas number_of_rows is controlling the batchSize? Is it correct? (An also it defaults to max int32 in gohbase as well).

My questions are:

  1. Which option is used to control the batchSize (in terms of rpc proto) ? caching or number_of_rows? Or other fields?
  2. What value it defaults to?
  3. How to tune it?
  4. Is it necessary to add a new filter about batchSize?
  5. Is tuning the batchSize will improve the performance? Or do we have a better plan to address this issue?
dethi commented 1 year ago

Which option is used to control the batchSize (in terms of rpc proto) ? caching or number_of_rows? Or other fields? What value it defaults to?

NumberOfRows: this is how many rows to fetch in each scan requests -> https://github.com/tsuna/gohbase/blob/master/hrpc/scan.go#L291 MaxResultSize: this is how many bytes of data to fetch in each scan requests (takes priority over NumberOfRows) -> https://github.com/tsuna/gohbase/blob/master/hrpc/scan.go#L309 Default: https://github.com/tsuna/gohbase/blob/master/hrpc/scan.go#L25-L30

batchSize is not used by GoHBase. Also, GoHBase doesn't have a in-memory cache for scanner, thus why we don't use the caching field from the proto definition. To be honest, I'm not so sure why this is even in the proto of HBase, since it doesn't seems to change any behaviour on the server, only the client seem impacted by it.

Depending on your use case, data shape and application, you could use AllowPartialResults https://github.com/tsuna/gohbase/blob/master/hrpc/scan.go#L324

fangxlmr commented 1 year ago

Which option is used to control the batchSize (in terms of rpc proto) ? caching or number_of_rows? Or other fields? What value it defaults to?

NumberOfRows: this is how many rows to fetch in each scan requests -> https://github.com/tsuna/gohbase/blob/master/hrpc/scan.go#L291 MaxResultSize: this is how many bytes of data to fetch in each scan requests (takes priority over NumberOfRows) -> https://github.com/tsuna/gohbase/blob/master/hrpc/scan.go#L309 Default: https://github.com/tsuna/gohbase/blob/master/hrpc/scan.go#L25-L30

Right, thanks for the explanation.

To be honest, I'm not so sure why this is even in the proto of HBase, since it doesn't seems to change any behaviour on the server, only the client seem impacted by it.

Agreed. I find no processing logic on server side to handle caching. Or maybe we just missed them.