Closed greg76 closed 4 years ago
Hi @greg76, and thanks for your interest in the library. I think this is an interesting topic. Given that it's not a bug or feature request, I'm going to close the issue, but I'd still be happy to continue the discussion here.
Tokenization does seem like promising approach. You might even be able to handle this entirely outside of markovify, pre-converting all non-period (or newline, etc., depending on what you're using to split sentences) to integer tokens before handing off to markovify, and then converting markovify's output back to the original strings, using a conversion dictionary produced in that first step.
First of all, markovify is great and thanks for sharing it. I have written my on markov text generator, but there were plenty new things to learn. So big kudos!
The issue I have is that I try to use text generation in memory constrained environment (serverless functions) and depending on the corpus size the export can get fairly big. Saving the chains to pickle won't do much as most of the exported data is just text. I was thinking of two ways to decrease the amount of memory used:
I'll try to do a post-processor for the markovify export, but I am curious how would you approach this topic or what would you recommend.