jliszka / probability-monad

Apache License 2.0
263 stars 38 forks source link

Markov Chain Text Generator #13

Closed jeffusan closed 7 years ago

jeffusan commented 7 years ago

Hi I really like this project. I used it to generate test data on a project recently.

I created a simple markov chain text generator this morning.

I considered using the probability-monad's markov chain functions to create my text generator, but couldn't think of a way to do it. I'm just curious to know if you might suggest a way to implement it?

It looks like I should provide the function for markov as String => Distribution[String]. The String should be the current word which returns a distribution of next words Distribution[String].

But what would the resulting Distribution[A] represent in this case? How would I take the result and generate text? distribution.sample(X)?

Thanks for providing this fascinating library.

jliszka commented 7 years ago

Thanks for the kind words! Glad you found it useful.

The current markov function only keeps track of the current state, and it sounds like you also want to keep track of the sequence of states that led you to the current state. So I might approach this in one of 2 ways:

  1. Make your distribution a distribution over Seq[String]. Your transition function f: Seq[String] => Distribution[Seq[String]] would take the last element of the list, look up the word in the frequency map, and append that to the end of the list. Then your final "sentence" is a single sample from that distribution.
always(Vector("the")).markov(words => !frequencyTable.contains(words.tail))(words => {
  frequencyTable(words.tail).map(newWord => words :+ newWord)
})
  1. Modify markov to keep track of the history of states, similar to what you're doing:
  final def markov(pred: A => Boolean)(f: A => Distribution[A]): Distribution[Seq[A]] = {
   override def get = {
      @tailrec
      def helper(a: A, history: Seq[A]): Seq[A] = {
        if (pred(a)) history
        else helper(f(a).get, history :+ a)
      }
      helper(self.get, Vector())
    }
  }

I might prefer the second approach because it keeps you "honest" in preserving the markov property that the next state is only dependent on the current state. However, you're also free to define the current state as "all the words generated so far" instead of just the last word, in which case the first approach is fine. I mean, it's also perfectly valid and desirable to look at the last 2 words in the sentence to determine the next word.

Hope that helps!

jeffusan commented 7 years ago

That's a very valid point about the first approach, I'm inclined to implement the second because I could easily add functionality through the helper function. That helps a lot, thank you!