Increase n-gram count - Githubissues

pteichman / cobe

A Markov chain based text generation library and MegaHAL style chatbot

http://teichman.org/blog/

MIT License

242 stars 51 forks source link

Increase n-gram count #31

Closed CrazyPython closed 7 years ago

CrazyPython commented 7 years ago

"cobe creates a graph of directed graph of n-grams (default n=3)"

Is there any way to increase this limit?

pteichman commented 7 years ago

This ngram order is stored in cobe's database and can't be changed on an existing database, so you'll need to create a new model and retrain from scratch.

To do this, you have to manually init a new database. On the command line, you can do this:

$ cobe init --order 5

Or in code:

Brain.init("cobe.brain", order=5)

If you aren't already storing your training corpus on its own, now might be a good time to start. It's always nice to be able to retrain from scratch. Cobe's successor (fate) takes this to an extreme: it's fast enough to always retrain on start, so it works directly with a one-sentence-per-line training corpus and stores nothing on disk.

CrazyPython commented 7 years ago

It's always nice to be able to retrain from scratch. Cobe's successor (fate) takes this to an extreme: it's fast enough to always retrain on start, so it works directly with a one-sentence-per-line training corpus and stores nothing on disk.

That's a good idea. The database is much larger on disk than the corpus and text fed, and downloading it would be slower than retraining.

Is there any way to do basic extraction of the input text from a database?

CrazyPython commented 7 years ago

Is there a Python API for fate?

CrazyPython commented 7 years ago

The current chatbot I'm creating takes uses this algorithm in generating sentences:

Analyze sentiment of sentence (nltk.sentiment)
Add/subtract the pos and neg numbers to feelings variable
Generate several replies, select the one closest to pos and neg

pteichman commented 7 years ago

There's no way to extract the training text from the database: by that point it has passed through the tokenizer (which may discard information) and is locked to the ngram order of the original database.

fate doesn't have a Python API, but it does include a web server: you can query the model with a request like http://host/reply?q=input. Fate doesn't have a scorer so you have full control over timing, but it's locked to a trigram model.

CrazyPython commented 7 years ago

@pteichman That's fine - how much better is fate?

pteichman commented 7 years ago

In terms of reply quality it's not much different: still a random walk on a trigram model.

On my end, the code is a lot simpler. I didn't bother with a complex tokenizer function or scoring because I found that results are good without either. Of course this still leaves you the flexibility to put your own scorer on top.

Operationally it's a lot simpler since it doesn't maintain its own database. I've liked being able to focus on the training corpus as text files rather than thinking about the state of the database, and it allows you to use standard unix tools to extract a corpus from other data.

Performance wise, there's no contest. Cobe was designed to minimize memory usage, almost to a fault: it was originally running on a machine with 64MB of RAM. It makes an on-disk database query for every word in a candidate reply. Fate serves its model directly from memory.

Looking at my current benchmarks, where cobe may take 100ms to generate a candidate reply, fate takes 90μs. It trains my irc corpus (~731k lines) in 8 seconds (4 MB/s), where cobe takes 13m22s (43 kB/s).

CrazyPython commented 7 years ago

@pteichman It seems fate should be my choice.

CrazyPython commented 7 years ago

@pteichman There aren't very many docs on fate's HTTP API, or on how to start a server.

CrazyPython commented 7 years ago

Opening up this issue again, how does one start the server?