mldbai / mldb

MLDB is the Machine Learning Database
http://mldb.ai
Apache License 2.0
661 stars 102 forks source link

ID transducer #945

Closed jeremybarnes closed 2 years ago

jeremybarnes commented 2 years ago

Many fields in datasets have a combination of a fixed format with some non-varying characters (eg, the UUIDs have "-" characters) and a reduced range of characters (eg, hexadecimal). This patch detects single-length fields and uses modulo-encoding to pack them into fixed-with integers with all of the entropy. It reduces (for example) the storage required for hexadecimal IDs by a factor of two.

It's a first step, there is much more that can be done.