quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.78k stars 651 forks source link

Support f64 Compression #2114

Open PSeitz opened 1 year ago

PSeitz commented 1 year ago

Problem Outline

Currently, tantivy stores f64 values uncompressed 8 bytes per value in its columnar storage. Generally float numbers are unsuitable for bitpacking, which requires a different approach, than the existing codecs.

Datasets

Some datasets that include floating numbers.

NYC Taxi

All float values seem to have a relative low cardinality, with up to two decimals.

{"DOLocationID":140,"PULocationID":239,"RatecodeID":1,"VendorID":1,"airport_fee":null,"congestion_surcharge":null,"extra":0.5,"fare_amount":14.0,"improvement_surcharge":0.3,"mta_tax":0.5,"passenger_count":1,"payment_type":2,"store_and_fwd_flag":"N","tip_amount":0.0,"tolls_amount":0.0,"total_amount":15.3,"tpep_dropoff_datetime":"2018-01-01 01:03:05 +00:00","tpep_pickup_datetime":"2018-01-01 00:44:55 +00:00","trip_distance":2.7}

Taiwan Air Quality

Same as taxi dataset - low cardinality, with up to two decimals.

{"AQI":26,"CO":"0.37","CO_8hr":0.42,"County":"臺北市","DataCreationDate":"2016-11-25 13:00","Latitude":"","Longitude":"","NO":6.6,"NO2":16.0,"NOx":"22.0","O3":"27","O3_8hr":"31.0","PM10":15.0,"PM10_AVG":10.0,"PM2.5":10.0,"PM2.5_AVG":7.0,"Pollutant":"nan","SO2":1.6,"SO2_AVG":null,"SiteId":"","SiteName":"古亭","Status":"良好","Unit":"","WindDirec":85.0,"WindSpeed":4.5}

Nooc Temperatures

240k Temperatures between -99.0 to 60.0 838 unique values

Sec gov logs - EDGAR Log

Slightly weird, all numbers have .0 suffix. Some are highly repetitive like code. Numbers with a decimal, even .0 a considered float in serde_json.

{"accession":"0001351930-07-000004","time":"00:00:00","zone":0.0,"noagent":0.0,"code":200.0,"date":"2017-06-30","cik":1372211.0,"extention":"-index.htm","idx":1.0,"browser":"","crawler":0.0,"ip":"107.23.85.jfd","size":2756.0,"norefer":0.0,"find":10.0}

Decimal Detection

One approach to f64 compression is to detect decimal values when the number of decimals are relatively small (e.g., less than 2). A very simple algorithm could just check the length of the unparsed number as string is short.

The numbers could then be stored as decimal instead of float, e.g. 100.23 => 10023 + num decimals (2)

Dictionary Compression

Dictionary compression works by putting all unique values in a dictionary and then storing only the ordinals bitpacked. Works great for few unique values.

Two possible approaches are considered: global dictionary compression and dictionary per block compression.

Global Dictionary Compression

Under the global dictionary compression scheme, a single dictionary is maintained for all data in a segment. The effectiveness of this method depends on the data distribution. For instance, data distributions following the Zipf's law or bimodal distributions are expected not to be a great fit.

Num bits saved vs f64 = 64 * num values - [num bits saved per val (64 - num bits dict) * number of values] - [num values dicitionary * num bits dict]

Dictionary per Block Compression

Alternatively, the dictionary per block compression approach involves maintaining a separate dictionary for each block of values. There are distributions where a per block compression would hurt compression ratio by repeating the same dictionaries.

Global vs Block-based

To know if a block a global dictionary would be better ideally we would know the distribution. Otherwise maybe some algorithm could check if the number of unique values between blocks are the same.

Delayed Conversion

We could delay the lookup for global dictionaries and return the ordinal instead. This is useful in aggregation for example, where we aggregate on terms and then fetch the top 10. For per Block dictionary compression this would not be possible, since the ordinals from different blocks are not comparable.

Cardinality Detection

To detect if a stream of numbers is a good candidate for dictionary compression, we need to estimate the number of unique values efficiently.

One common approach is HyperLogLog. It may require a lot of samples on Zipfs distribution to be accurate and can be relative expensive during estimation (~100ms for 1mio els).

Alternatively we could try to fit some distributions (on the first N elements) and extrapolate to the total unique count if one matches.

Alternatives

XOR

XOR compression does not provide random access to the data, which makes it unsuitable. Additionally the savings in the "Gorilla" paper, are on highly repetitive data patterns (50% are the same value), which are not that common.

f32

We can save 4bytes by storing the value as f32. We can likely do better.

Decomposed Bounded Floats

Decomposed bounded floats could be an alternative that allows random access. There are some parts I don't understand like there's some loss of precision in the 3.14 example, from 3.1400001 to 3.13671875, but the algorithm is stated to be lossless. Paper: http://vldb.org/pvldb/vol14/p2586-liu.pdf Rust Impl: https://github.com/Tranway1/buff

Open Questions

On the analyzed datasets, dictionary compression should be a viable option to achieve a good compression ratio. In principle all numbers with few numbers after the decimal are likely good candidates for dictionary compression.

It's unclear if we should have

If a block-based dictionary would be preferred, how would we detect the size of the blocks. Where would be the threshold?

adamreichold commented 1 year ago

If you are interested, I could try to prepare another dataset with bounding boxes stored from multiple sources stored as 4 floating point numbers per dataset representing WGS84 coordinates. Some sources for geographic information yield only points, so the 4 numbers are only 2 distinct values sometimes. Otherwise, I think this is of relatively high cardinality , so probably a useful case to test the cardinality detection logic?

PSeitz commented 1 year ago

Yes, having more datasets would be nice. geo-data is probably a little special, since it ideally belongs to an own index that allows geo queries.

adamreichold commented 1 year ago

geo-data is probably a little special, since it ideally belongs to an own index, that allows geo queries.

It does, but we warm up those r* trees by reading the bounding boxes from Tantivy, i.e. https://gitlab.opencode.de/umwelt-info/metadaten/-/blob/main/src/index/bounding_box.rs

EDIT: Previously, we used a simpler filtering approach without the r* trees and while it was more than 50% slower, it still worked reasonably well.

adamreichold commented 1 year ago

As discussed, here is a list of lists of floating point values which represent bounding boxes of publicly available environmental/conservational metadata:

bounding_boxes.json.gz

(Four values would form a rectangle, but we do store them in this flat manner in Tantivy as well. Most of the inner lists are actually empty because the documents do not have geometry attached but I suspected that this might still be useful for sparsity considerations.)