Open mureni opened 3 years ago
First and foremost, this is a very old format and not well thought out. There are many more efficient and reasonable ways to accomplish these tasks, and future versions are likely to drop support for this schema.
However, until further changes are made, a trainer JSON schema is something along the lines of the following, as expressed in TypeScript:
// Word separator character expected to be Box Line Drawing Character '│' (NOT regular bar character '|')
const WordSeparator = String.fromCharCode(9474);
interface FrequencyJSON {
[word: string]: number; // Map of words to the number of times the word should be represented in weighted algorithms
}
interface nGramJSON {
[hash: string]: { // Hash is all the tokens joined by the word separator token as shown above
t: string[], // String array of this ngram's tokens, used to generate its hash
s: boolean, // Can this ngram start a sentence?
e: boolean, // Can this ngram end a sentence?
n: FrequencyJSON, // Map of possible next words and their associated frequencies
p: FrequencyJSON // Map of possible previous words and their associated frequencies
}
}
interface LexiconJSON {
[word: string]: string[]; // Map of unique words to a string array containing all the ngram hashes the word is found in
}
If a trainer file was generated from the text an example trigram with another trigram
it would be represented in JSON as follows:
{
"Lexicon": [
{
"an": [ "an│example│trigram" ],
"example": [ "an│example│trigram", "example│trigram│with" ],
"trigram": [ "an│example│trigram", "example│trigram│with", "trigram│with│another", "with│another│trigram" ],
"with": [ "example│trigram│with", "trigram│with│another", "with│another│trigram" ],
"another": [ "trigram│with│another", "with│another│trigram" ]
}
],
"nGrams": [
"an│example│trigram": {
t: ["an", "example", "trigram"],
s: true,
e: false,
n: [
{ "with": 1 }
],
p: []
},
/* other trigrams omitted for brevity... */
"with│another│trigram": {
t: ["with", "another", "trigram"],
s: false,
e: true,
n: [],
p: [
{ "trigram": 1 }
]
}
]
}
As far as how the SQLite database is formatted; it's similar, but replaces t
with tokens
, s
with canStart
, e
with canEnd
, n
with nextTokens
, p
with previousTokens
, and adds a property __ctor
with the value of either Map
or Set
depending on if it is an array of objects with a key (Map
) or an array of strings (Set
).
Updated to identify that it can currently learn from a plain text file where each line represents a line that otherwise would have been learned through the message process. It is very slow, but it works, for now.
Currently there are only two ways to pre-train the bot:
data
directory with the file naming format of./data/[bot name].sql
resources
directory with the file naming format of./resources/[bot name]-trainer.json
Without these, it will start with an empty brain until it learns more.