Decreasing export size / memory usage

First of all, markovify is great and thanks for sharing it. I have written my on markov text generator, but there were plenty new things to learn. So big kudos!

The issue I have is that I try to use text generation in memory constrained environment (serverless functions) and depending on the corpus size the export can get fairly big. Saving the chains to pickle won't do much as most of the exported data is just text. I was thinking of two ways to decrease the amount of memory used:

Tokenisation. Instead of repeating the words in multiple places (in "begins with" tuples and in "followed by" words) you could use integer tokens.
Frequency threshold: eg. certain combination of words would only get into the chain if they occur more then once in the corpus. This is more brutal. I wonder if I can just throw away entries from the chain with the "weight" of 1 (I would have to calculate it from the cumulated weight value you are including in the export) or if I also should check for any broken chains, possibly swap some removed entry references with "END")

I'll try to do a post-processor for the markovify export, but I am curious how would you approach this topic or what would you recommend.

jsvine / markovify

Decreasing export size / memory usage #146