I may have misunderstood the intent with the section under Advance Usage / Tokenizers (https://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced) but I cannot get my passed in tokenizer to work with blob.sentences. Is that the intended behaviour? (it would be helpful as then once you've passed in a different tokenizer, none of the rest of your code needs to be changed, but I can live w/o it if it's not practical or possible)
My ultimate intent it to allow sentences to be aware of some abbreviations it is currently mistaking for sentence ends ("min." and "max."). I found how to do this wihtin NLTK itself (here: http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer ) and was able to combine this by following the advanced usage, but it doesn't pick up the new tokenizer if I use .sentences still. Here's the code:
from textblob import TextBlob
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['min', 'max'])
tokenizer = PunktSentenceTokenizer(punkt_param)
blob = TextBlob("""This is a normal sentence with but with an abbreviation of min. in it. This one has max. in it too. This has none.""", tokenizer=tokenizer)
blob.tokens
for s1 in blob.sentences:
print(s1)
for s2 in blob.tokens:
print(s2)
which results in the output of the screenshot. Anything obvious I'm doing wrong or overlooking?
One additional point: it seems that it wouldn't be as simple as just using .tokens instead of .sentences, since several of the properties and methods no longer seem to work then.
I may have misunderstood the intent with the section under Advance Usage / Tokenizers (https://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced) but I cannot get my passed in tokenizer to work with blob.sentences. Is that the intended behaviour? (it would be helpful as then once you've passed in a different tokenizer, none of the rest of your code needs to be changed, but I can live w/o it if it's not practical or possible)
My ultimate intent it to allow sentences to be aware of some abbreviations it is currently mistaking for sentence ends ("min." and "max."). I found how to do this wihtin NLTK itself (here: http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer ) and was able to combine this by following the advanced usage, but it doesn't pick up the new tokenizer if I use .sentences still. Here's the code:
which results in the output of the screenshot. Anything obvious I'm doing wrong or overlooking?
Many thanks, Neil