hortonworks-spark / shc

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Apache License 2.0
552 stars 281 forks source link

Multi row range filters? #344

Open alpuy opened 3 years ago

alpuy commented 3 years ago

I have the following data table: columns = { "meas_point_key": {"cf": "rowkey", "col": "key1", "type": "string", "length":"8"}, "date_key": {"cf": "rowkey", "col": "key2", "type": "string", "length":"14"}, "magnitude_key": {"cf": "rowkey", "col": "key3", "type": "string", "length":"2"}, "meas_int_key": {"cf": "rowkey", "col": "key4", "type": "string", "length":"1"}, "source_key": {"cf": "rowkey", "col": "key5", "type": "string"}, "date": {"cf": "IV", "col": "D", "type": "bigint"}, "file": {"cf": "IV", "col": "F", "type": "string"}, "last_update_date": {"cf": "IV", "col": "L", "type": "bigint"}, "magnitude": {"cf": "IV", "col": "M", "type": "bigint"}, "meas_int": {"cf": "IV", "col": "MI", "type": "bigint"}, "meas_point": {"cf": "IV", "col": "MP", "type": "bigint"}, "source": {"cf": "IV", "col": "S", "type": "bigint"}, "value": {"cf": "IV", "col": "V", "type": "double"}, "last_update_val": {"cf": "IV", "col": "LAV", "type": "bigint"}, "val_det": {"cf": "IV", "col": "VD", "type": "string"}, "val_res": {"cf": "IV", "col": "VR", "type": "string"} }

and i want to scan based on the rowkey with the following filters:

df = df1.where((df1.meas_point_key.isin(meter_list_B.value) ) & (df1.magnitude_key == "13") & (df1.date_key >= '01588302000000') & (df1.date_key <= '01593572400000') & (df1.meas_int_key == '1'))

where meter_list_B is a broadcasted list of string values, this list contains about 15000 values.

Is this query optimal? because i think that because of the time it is taking it is not an optimal scan.

Are MultiRowRangeFilters used in shc?