lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.93k stars 215 forks source link

Don't use MaterializeIndex if there is a block list #1896

Open westonpace opened 9 months ago

westonpace commented 9 months ago

Given the query SELECT * FROM ... WHERE id = 7, assuming a scalar index on id, we generate a small allow_list, materialize it, and feed it to take. This works well.

Given the query SELECT * FROM ... WHERE id != 7, we do the same thing, this does not work well.

Instead, we should do a full scan, then apply a filter to remove the rows that are in the block list (or just don't use the index at all)

Note: this only applies to queries WITHOUT a vector search. When there is a vector index we don't need to materialize the scalar index results and so the rationale is completely different.

westonpace commented 9 months ago

Some benchmarks (@50M rows) to justify this:

Name (time in ms)                                                                 Min                   Max                  Mean              StdDev                Median                 IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_scalar_index_search[clean-equality]                                       1.6202 (1.0)          2.4431 (1.0)          1.7421 (1.0)        0.1202 (1.0)          1.7206 (1.0)        0.0908 (1.0)          13;9  574.0160 (1.0)         121           1
test_scalar_index_search[clean-in_list_one]                                    1.6834 (1.04)        11.8010 (4.83)         2.4469 (1.40)       2.2548 (18.76)        1.8335 (1.07)       0.2162 (2.38)          3;6  408.6851 (0.71)         54           1
test_scalar_index_search[with_new_rows-in_list_one]                            1.7134 (1.06)         3.3687 (1.38)         2.0622 (1.18)       0.2632 (2.19)         2.0226 (1.18)       0.1898 (2.09)        27;13  484.9220 (0.84)        182           1
test_scalar_index_search[with_new_rows-equality]                               1.7388 (1.07)         4.1352 (1.69)         2.0584 (1.18)       0.3217 (2.68)         2.0105 (1.17)       0.2745 (3.02)          5;4  485.8196 (0.85)         93           1
test_scalar_index_search[clean-less_than_selective]                            1.9507 (1.20)         3.4980 (1.43)         2.1856 (1.25)       0.1998 (1.66)         2.1458 (1.25)       0.1568 (1.73)        29;16  457.5327 (0.80)        298           1
test_scalar_index_search[with_new_rows-less_than_selective]                    2.0266 (1.25)         7.6367 (3.13)         2.4746 (1.42)       0.4396 (3.66)         2.3821 (1.38)       0.3090 (3.40)        17;12  404.1098 (0.70)        349           1
test_scalar_index_search[with_delete_files-equality]                           3.0786 (1.90)         6.1823 (2.53)         3.4407 (1.98)       0.3387 (2.82)         3.3746 (1.96)       0.2119 (2.33)         12;8  290.6377 (0.51)        137           1
test_scalar_index_search[with_delete_files-in_list_one]                        3.4059 (2.10)         4.5061 (1.84)         3.7989 (2.18)       0.2360 (1.96)         3.7727 (2.19)       0.3145 (3.46)         44;3  263.2332 (0.46)        124           1
test_scalar_index_search[with_delete_files-less_than_selective]                3.6465 (2.25)         9.7798 (4.00)         4.0971 (2.35)       0.5013 (4.17)         4.0490 (2.35)       0.2774 (3.06)          6;6  244.0722 (0.43)        157           1
test_scalar_index_search[clean-none]                                         696.2231 (429.70)     782.1625 (320.16)     734.0862 (421.38)    38.3136 (318.72)     723.0069 (420.21)    68.9626 (759.59)        1;0    1.3622 (0.00)          5           1
test_scalar_index_search[with_delete_files-none]                             739.9145 (456.67)     761.9250 (311.87)     749.1443 (430.02)     8.5249 (70.92)      747.1020 (434.22)    11.9411 (131.53)        2;0    1.3349 (0.00)          5           1
test_scalar_index_search[with_new_rows-none]                                 742.3391 (458.16)     792.4247 (324.36)     764.7724 (438.99)    19.8576 (165.19)     764.8674 (444.54)    30.7470 (338.66)        2;0    1.3076 (0.00)          5           1
test_scalar_index_search[clean-not_equality]                               1,129.8565 (697.34)   1,363.5540 (558.13)   1,227.5580 (704.64)   110.5293 (919.48)   1,159.5230 (673.92)   191.5590 (>1000.0)       1;0    0.8146 (0.00)          5           1
test_scalar_index_search[clean-not_in_list_one]                            1,134.1074 (699.96)   1,263.9427 (517.36)   1,173.0443 (673.35)    52.7296 (438.65)   1,150.1651 (668.48)    53.5838 (590.20)        1;0    0.8525 (0.00)          5           1
test_scalar_index_search[clean-not_equality_and_chain]                     1,148.5113 (708.85)   1,209.9530 (495.26)   1,191.4977 (683.94)    24.6590 (205.13)   1,197.2188 (695.82)    21.5631 (237.51)        1;1    0.8393 (0.00)          5           1
test_scalar_index_search[clean-in_list_three]                              1,158.0684 (714.75)   1,239.1014 (507.19)   1,207.8667 (693.33)    34.6018 (287.85)   1,224.0058 (711.39)    54.9572 (605.33)        1;0    0.8279 (0.00)          5           1
test_scalar_index_search[with_new_rows-not_equality]                       1,167.2066 (720.39)   1,359.3954 (556.43)   1,254.0329 (719.84)    89.0691 (740.95)   1,213.8336 (705.48)   160.5098 (>1000.0)       1;0    0.7974 (0.00)          5           1
test_scalar_index_search[with_new_rows-not_in_list_one]                    1,173.1351 (724.05)   1,372.3914 (561.75)   1,284.1288 (737.11)    76.9859 (640.43)   1,275.4549 (741.29)   108.1893 (>1000.0)       2;0    0.7787 (0.00)          5           1
test_scalar_index_search[with_new_rows-in_list_three]                      1,183.4061 (730.39)   1,386.9354 (567.70)   1,251.1625 (718.19)    78.5276 (653.26)   1,231.9633 (716.02)    60.4255 (665.56)        1;1    0.7993 (0.00)          5           1
test_scalar_index_search[with_new_rows-not_equality_and_chain]             1,193.9693 (736.91)   1,357.6298 (555.71)   1,266.2167 (726.83)    71.5113 (594.89)   1,227.2801 (713.30)   117.1642 (>1000.0)       2;0    0.7898 (0.00)          5           1
test_scalar_index_search[with_delete_files-not_equality]                   2,876.2944 (>1000.0)  3,071.6768 (>1000.0)  2,968.8758 (>1000.0)   86.7243 (721.45)   2,970.9656 (>1000.0)  159.4035 (>1000.0)       2;0    0.3368 (0.00)          5           1
test_scalar_index_search[with_delete_files-not_in_list_one]                2,895.8059 (>1000.0)  3,083.3901 (>1000.0)  3,011.1086 (>1000.0)   69.3674 (577.06)   3,024.4056 (>1000.0)   57.9921 (638.76)        2;1    0.3321 (0.00)          5           1
test_scalar_index_search[with_delete_files-not_equality_and_chain]         2,910.2556 (>1000.0)  3,158.7741 (>1000.0)  3,035.7476 (>1000.0)   95.1737 (791.73)   3,062.0933 (>1000.0)  132.2388 (>1000.0)       2;0    0.3294 (0.00)          5           1
test_scalar_index_search[with_delete_files-in_list_three]                  2,914.8616 (>1000.0)  3,195.7641 (>1000.0)  3,038.2834 (>1000.0)  109.5526 (911.35)   3,001.0621 (>1000.0)  156.2868 (>1000.0)       2;0    0.3291 (0.00)          5           1
test_scalar_index_search[with_new_rows-greater_than_not_selective]         2,953.6200 (>1000.0)  3,191.9999 (>1000.0)  3,023.4207 (>1000.0)   96.0136 (798.72)   2,997.4851 (>1000.0)   76.6214 (843.95)        1;1    0.3308 (0.00)          5           1
test_scalar_index_search[clean-greater_than_not_selective]                 2,962.4572 (>1000.0)  3,107.2330 (>1000.0)  3,019.5229 (>1000.0)   63.1493 (525.33)   2,996.0779 (>1000.0)  106.7595 (>1000.0)       1;0    0.3312 (0.00)          5           1
test_scalar_index_search[with_delete_files-greater_than_not_selective]     4,597.3313 (>1000.0)  4,869.1265 (>1000.0)  4,762.7918 (>1000.0)  108.6766 (904.06)   4,784.1855 (>1000.0)  159.1199 (>1000.0)       1;0    0.2100 (0.00)          5           1

Note that clean-not_equality is 2x slower than a full scan.