Open PSeitz opened 1 year ago
If you are interested, I could try to prepare another dataset with bounding boxes stored from multiple sources stored as 4 floating point numbers per dataset representing WGS84 coordinates. Some sources for geographic information yield only points, so the 4 numbers are only 2 distinct values sometimes. Otherwise, I think this is of relatively high cardinality , so probably a useful case to test the cardinality detection logic?
Yes, having more datasets would be nice. geo-data is probably a little special, since it ideally belongs to an own index that allows geo queries.
geo-data is probably a little special, since it ideally belongs to an own index, that allows geo queries.
It does, but we warm up those r* trees by reading the bounding boxes from Tantivy, i.e. https://gitlab.opencode.de/umwelt-info/metadaten/-/blob/main/src/index/bounding_box.rs
EDIT: Previously, we used a simpler filtering approach without the r* trees and while it was more than 50% slower, it still worked reasonably well.
As discussed, here is a list of lists of floating point values which represent bounding boxes of publicly available environmental/conservational metadata:
(Four values would form a rectangle, but we do store them in this flat manner in Tantivy as well. Most of the inner lists are actually empty because the documents do not have geometry attached but I suspected that this might still be useful for sparsity considerations.)
Problem Outline
Currently, tantivy stores f64 values uncompressed 8 bytes per value in its columnar storage. Generally float numbers are unsuitable for bitpacking, which requires a different approach, than the existing codecs.
Datasets
Some datasets that include floating numbers.
NYC Taxi
All float values seem to have a relative low cardinality, with up to two decimals.
Taiwan Air Quality
Same as taxi dataset - low cardinality, with up to two decimals.
Nooc Temperatures
240k Temperatures between -99.0 to 60.0 838 unique values
Sec gov logs - EDGAR Log
Slightly weird, all numbers have
.0
suffix. Some are highly repetitive likecode
. Numbers with a decimal, even.0
a considered float inserde_json
.Decimal Detection
One approach to f64 compression is to detect decimal values when the number of decimals are relatively small (e.g., less than 2). A very simple algorithm could just check the length of the unparsed number as string is short.
The numbers could then be stored as decimal instead of float, e.g. 100.23 => 10023 + num decimals (2)
Dictionary Compression
Dictionary compression works by putting all unique values in a dictionary and then storing only the ordinals bitpacked. Works great for few unique values.
Two possible approaches are considered: global dictionary compression and dictionary per block compression.
Global Dictionary Compression
Under the global dictionary compression scheme, a single dictionary is maintained for all data in a segment. The effectiveness of this method depends on the data distribution. For instance, data distributions following the Zipf's law or bimodal distributions are expected not to be a great fit.
Num bits saved vs f64 =
64 * num values - [num bits saved per val (64 - num bits dict) * number of values] - [num values dicitionary * num bits dict]
Dictionary per Block Compression
Alternatively, the dictionary per block compression approach involves maintaining a separate dictionary for each block of values. There are distributions where a per block compression would hurt compression ratio by repeating the same dictionaries.
Global vs Block-based
To know if a block a global dictionary would be better ideally we would know the distribution. Otherwise maybe some algorithm could check if the number of unique values between blocks are the same.
Delayed Conversion
We could delay the lookup for global dictionaries and return the ordinal instead. This is useful in aggregation for example, where we aggregate on terms and then fetch the top 10. For per Block dictionary compression this would not be possible, since the ordinals from different blocks are not comparable.
Cardinality Detection
To detect if a stream of numbers is a good candidate for dictionary compression, we need to estimate the number of unique values efficiently.
One common approach is HyperLogLog. It may require a lot of samples on Zipfs distribution to be accurate and can be relative expensive during estimation (~100ms for 1mio els).
Alternatively we could try to fit some distributions (on the first N elements) and extrapolate to the total unique count if one matches.
Alternatives
XOR
XOR compression does not provide random access to the data, which makes it unsuitable. Additionally the savings in the "Gorilla" paper, are on highly repetitive data patterns (50% are the same value), which are not that common.
f32
We can save 4bytes by storing the value as f32. We can likely do better.
Decomposed Bounded Floats
Decomposed bounded floats could be an alternative that allows random access. There are some parts I don't understand like there's some loss of precision in the 3.14 example, from
3.1400001
to3.13671875
, but the algorithm is stated to be lossless. Paper: http://vldb.org/pvldb/vol14/p2586-liu.pdf Rust Impl: https://github.com/Tranway1/buffOpen Questions
On the analyzed datasets, dictionary compression should be a viable option to achieve a good compression ratio. In principle all numbers with few numbers after the decimal are likely good candidates for dictionary compression.
It's unclear if we should have
If a block-based dictionary would be preferred, how would we detect the size of the blocks. Where would be the threshold?