mndrix / golog

Prolog interpreter in Go
MIT License
374 stars 39 forks source link

optimize atom representation #14

Open mndrix opened 11 years ago

mndrix commented 11 years ago

Atoms (especially those used as functor names), are really just fancy integers with a human-readable form. I suspect that about 85% of all atoms can be mapped to a 64-bit integer. For those that can't be, fall back to the current string representation.

The simplest coding that could possibly work is:

Optimize that by extending the base-27 alphabet to a base-64 alphabet where the extra 37 symbols are the 37 most common bigrams in English text. A base-32 alphabet with only 5 bigrams might work better.

Optimize that with Huffman coding or arithmetic encoding to fit more symbols into a 64 bit integer (or float).

The real goal here is not to achieve the highest possible compression ratio (encoding long atoms into small integers). Rather, we want to encode the highest percentage of atoms into a fixed space.

If I've correctly understood Go's internal string representation, a single-character string occupies 17 bytes on a 64-bit machine.