tzaeschke / zoodb

ZooDB Object Database
Apache License 2.0
57 stars 9 forks source link

Change String magic number to be case insensitive #95

Open tzaeschke opened 7 years ago

tzaeschke commented 7 years ago

The String magic number (used for indexes) currently use 8 bits each and they are (thus) case sensitive.

Cutting off the first 3 bits would allow to use 9.6 instead of 6 character and would make the numbers case-insensitive. Cutting of another bit would allow 12 characters to to stored.

Unfortunately, this would also affect ordering of results...

tzaeschke commented 7 years ago

Problem: We need to combine SORTING with CASE insensitivity

Approaches: a) Two separate indexes b) Double entries in the same index (?) c) Define behavior during index creation, i.e. provide an option for index behaviour d) Combine!

Solutions: C) No bad at all... D) Sorting means ALPHABETICALLY sorted, so only 0..9 / A..Z need to be sorted, the rest can be arbitrarily sorted. We need to preserve ordering only for data in the ranges (char code) 48..57, 65..90 (and 97..122). Everything else can be mapped to, for example, 32..47.

Algorithm: a) if (x <32 || x >=128) { map to 0..15 // x &= 0x3F add 32 } else if (x >= 96) { sub 32 } b) sub 32 //move everything into 0..64 c) x <<= 2 //compress to 6 bit

This is case-insensitive, preserves the ordering and compresses the data to 6 bits, allowing 8 characters in the magic number. Obviously this approach assumes that Strings consist mostly of letters and numbers.