sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
MIT License
9.14k stars 1.15k forks source link

Tokenizing sentences in quotations #109

Open dagrha opened 8 years ago

dagrha commented 8 years ago

I love TextBlob, thank you so much for making this awesome Python tool :+1:

I am wondering if there is a solution to a tokenization issue I'm seeing. Here's some example code with an excerpt from Game of Thrones to demonstrate the issue:

In [1]: from textblob import TextBlob
In [2]: text = TextBlob('“We should start back,” Gared urged as the woods began to grow dark around them. “The wildlings are dead.” “Do the dead frighten you?” Ser Waymar Royce asked with just the hint of a smile. Gared did not rise to the bait. He was an old man, past fifty, and he had seen the lordlings come and go. “Dead is dead,” he said. “We have no business with the dead.” “Are they dead?” Royce asked softly.')
In [3]: text.sentences

And here's the ouput when I call the sentences attribute:

Out[3]: 
[Sentence("“We should start back,” Gared urged as the woods began to grow dark around them."),
 Sentence("“The wildlings are dead.” “Do the dead frighten you?” Ser Waymar Royce asked with just the hint of a smile."),
 Sentence("Gared did not rise to the bait."),
 Sentence("He was an old man, past fifty, and he had seen the lordlings come and go."),
 Sentence("“Dead is dead,” he said."),
 Sentence("“We have no business with the dead.” “Are they dead?” Royce asked softly.")]

The issue here is that TextBlob is tokenizing sentences that run together with quotations as a single sentence. The second "sentence" above demonstrates this:

Sentence("“The wildlings are dead.” “Do the dead frighten you?” Ser Waymar Royce asked with just the hint of a smile.")

should instead be:

Sentence("“The wildlings are dead.”) 
Sentence(“Do the dead frighten you?” Ser Waymar Royce asked with just the hint of a smile.")

The same is the case for the last example sentence I've shown:

Sentence("“We have no business with the dead.” “Are they dead?” Royce asked softly.")

It seems that TextBlob does not Tokenize a sentence if it appears in quotes. In other words, “We have no business with the dead.” is its own sentence, but TextBlob tokenizes the sentence such that it also includes the phrases that follow: “Are they dead?” Royce asked softly."

Is there a way to avoid this, that is to force TextBlob to treat an occurrence of ." as the end of a sentence?

ghost commented 7 years ago

@dagrha you might consider using .split and then running sentence tokenizer over the strings (a quote obviously returns a single sentence) and then merging the returned lists