jsvine / markovify

A simple, extensible Markov chain generator.
MIT License
3.31k stars 350 forks source link

Brackets and Speechmarks #186

Closed keab42 closed 7 months ago

keab42 commented 7 months ago

I'm experimenting with importing a variety of different texts for a pet project.

I've been working on consuming and sanitising some of the text, but with NTLK tagging activated, I've run into what appears to be an issue parsing text that contains brackets, square brackets, double and single quotes.

I can easily remove these from the text, but it would be nice to be able to preserve this punctuation if possible.

I did try something like str.replace("[", "["), but that did not seem to help.

For example:

Input string: "[Babbles back] Sixty seconds."

Error stack trace:

  File "Test.py", line 42, in generate_model
    model = POSifiedText(model_text, state_size = self.state_size)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "%APPDATA%\Roaming\Python\Python312\site-packages\markovify\text.py", line 65, in __init__
    self.chain = chain or Chain(self.parsed_sentences, state_size)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "%APPDATA%\Roaming\Python\Python312\site-packages\markovify\chain.py", line 53, in __init__
    self.precompute_begin_state()
  File "%APPDATA%\Roaming\Python\Python312\site-packages\markovify\chain.py", line 102, in precompute_begin_state
    choices, cumdist = compile_next(self.model[begin_state])
                                    ~~~~~~~~~~^^^^^^^^^^^^^
KeyError: ('___BEGIN__', '___BEGIN__')

My POSifiedText class looks like this:

import markovify
import nltk
import re

nltk.download('averaged_perceptron_tagger')

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence
keab42 commented 7 months ago

Ah. I thought I'd checked all the closed issues thoroughly, but this is a duplicate of https://github.com/jsvine/markovify/issues/84

Should be easy enough to adjust my input texts to exclude ones with only one sentence.