sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
MIT License
9.14k stars 1.15k forks source link

Advanced usage of tokenizer for sentence tokenization #90

Open nmstoker opened 9 years ago

nmstoker commented 9 years ago

I may have misunderstood the intent with the section under Advance Usage / Tokenizers (https://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced) but I cannot get my passed in tokenizer to work with blob.sentences. Is that the intended behaviour? (it would be helpful as then once you've passed in a different tokenizer, none of the rest of your code needs to be changed, but I can live w/o it if it's not practical or possible)

My ultimate intent it to allow sentences to be aware of some abbreviations it is currently mistaking for sentence ends ("min." and "max."). I found how to do this wihtin NLTK itself (here: http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer ) and was able to combine this by following the advanced usage, but it doesn't pick up the new tokenizer if I use .sentences still. Here's the code:

from textblob import TextBlob
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['min', 'max'])
tokenizer = PunktSentenceTokenizer(punkt_param)
blob = TextBlob("""This is a normal sentence with but with an abbreviation of min. in it. This one has max. in it too. This has none.""", tokenizer=tokenizer)
blob.tokens
for s1 in blob.sentences:
    print(s1)
for s2 in blob.tokens:
    print(s2)

which results in the output of the screenshot. Anything obvious I'm doing wrong or overlooking?

Many thanks, Neil

image

nmstoker commented 9 years ago

One additional point: it seems that it wouldn't be as simple as just using .tokens instead of .sentences, since several of the properties and methods no longer seem to work then.