njtierney / syn

syn - the thesaurus
http://syn.njtierney.com/
51 stars 4 forks source link

overhaul the synonym data structure to take up less space #16

Closed coolbutuseless closed 5 years ago

coolbutuseless commented 5 years ago

This is a monster overhaul of the main "words" data structure.

Instead of storing raw words, we split it up and store:

  1. A sorted list of all unique words
  2. Convert each character vector of synonyms into an integer vector (indexing into the list of all words)

By storing integer vectors rather than character strings there is about a 50% reduction in memory usage, and the compressed data is now <5MB.

The downside is that creating the integer vectors from the word lists isn't very fast, and you wouldn't want to do this dynamically.

The upsides: