mureni / charlies2

stochastic terrorist revived
1 stars 0 forks source link

Create schema for trainer JSON and provide sample trainer file #1

Open mureni opened 3 years ago

mureni commented 3 years ago

Currently there are only two ways to pre-train the bot:

  1. Copy an existing SQLite database with the expected format into the (manually created) data directory with the file naming format of ./data/[bot name].sql
  2. Create a trainer file meeting the (as-yet unspecified) schema in the resources directory with the file naming format of ./resources/[bot name]-trainer.json

Without these, it will start with an empty brain until it learns more.

mureni commented 3 years ago

First and foremost, this is a very old format and not well thought out. There are many more efficient and reasonable ways to accomplish these tasks, and future versions are likely to drop support for this schema.

However, until further changes are made, a trainer JSON schema is something along the lines of the following, as expressed in TypeScript:

// Word separator character expected to be Box Line Drawing Character '│' (NOT regular bar character '|')
const WordSeparator = String.fromCharCode(9474); 

interface FrequencyJSON {
   [word: string]: number; // Map of words to the number of times the word should be represented in weighted algorithms
}
interface nGramJSON {
   [hash: string]: {       // Hash is all the tokens joined by the word separator token as shown above
      t: string[],         // String array of this ngram's tokens, used to generate its hash 
      s: boolean,          // Can this ngram start a sentence?
      e: boolean,          // Can this ngram end a sentence?
      n: FrequencyJSON,    // Map of possible next words and their associated frequencies
      p: FrequencyJSON     // Map of possible previous words and their associated frequencies
   }
}
interface LexiconJSON {
   [word: string]: string[];  // Map of unique words to a string array containing all the ngram hashes the word is found in
}

If a trainer file was generated from the text an example trigram with another trigram it would be represented in JSON as follows:

{
   "Lexicon": [
      {
         "an": [ "an│example│trigram" ],
         "example": [ "an│example│trigram", "example│trigram│with" ], 
         "trigram": [ "an│example│trigram", "example│trigram│with", "trigram│with│another", "with│another│trigram" ],
         "with": [ "example│trigram│with", "trigram│with│another", "with│another│trigram" ],
         "another": [ "trigram│with│another", "with│another│trigram" ]
      }
   ],
   "nGrams": [
      "an│example│trigram": {
         t: ["an", "example", "trigram"],
         s: true,
         e: false,
         n: [
            { "with": 1 }
         ],
         p: []
      },
      /* other trigrams omitted for brevity... */
      "with│another│trigram": {
         t: ["with", "another", "trigram"],
         s: false,
         e: true,
         n: [],
         p: [
            { "trigram": 1 }
         ]
      }
   ]
}

As far as how the SQLite database is formatted; it's similar, but replaces t with tokens, s with canStart, e with canEnd, n with nextTokens, p with previousTokens, and adds a property __ctor with the value of either Map or Set depending on if it is an array of objects with a key (Map) or an array of strings (Set).

mureni commented 2 years ago

Updated to identify that it can currently learn from a plain text file where each line represents a line that otherwise would have been learned through the message process. It is very slow, but it works, for now.