Open alvations opened 6 years ago
If we make the following changes to word_tokenize
at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/__init__.py, it would achieve similar behavior as of Stanford CoreNLP:
import re
from nltk.tokenize.treebank import TreebankWordTokenizer
# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()
# See discussion on
# - https://github.com/nltk/nltk/pull/1437
# - https://github.com/nltk/nltk/issues/1995
# Adding to TreebankWordTokenizer, the splits on
# - chervon quotes u'\xab' and u'\xbb' .
# - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
# - opening single quotes if the token that follows isn't a clitic
improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
improved_close_quote_regex = re.compile(u'([»”’])', re.U)
improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
_treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
_treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
_treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
_treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))
def word_tokenize(text, language='english', preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:type text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserver_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)]
[out]:
>>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
>>> word_tokenize("'v' 're'")
["'", 'v', "'", "'re", "'"]
The above regex hack will cover the following clitics:
're
've
'll
'd
't
's
'm
Are there more clitics that should be added?
What about the ending single quotes, which appear in the end of the objective form of plurals? Like "providers'".
Do "readers' " need to be tokenized to "readers" and "'"? Also, what is the status of this bug so far? If the change mentioned above has not been implemented, I'd like to take up this issue
This issue is on the opening quotes and the clitic fix for that can be easily done and that'll make the word_tokenize
behave like Stanford's. IMHO, I think it's a good feature to have.
Feel free to contribute and open a pull-request on it =)
But to handle possessive plural, it's hard to understand how to do it because we need to define the difference between sentences like:
The providers' CEO went on a holiday.
He said, 'The CEO has fired the providers'.
Breaking news: 'Down to all providers'
The 'internet providers' have went on a holiday.
There are too many instances where the possessive plurals that can be confused with closing single quotes. I would say for use of single quotes for the plural possessive, I don't think it's worth fixing.
import nltk import codecs
file=codecs.open('new2.txt','r','utf8')
fh=file.readlines() #['సతతహరిత', 'సమశీతోష్ణ', ' అడవి-ఇల్లు*అడవి ']-the line is stored in new1.txt
for line in fh: l=nltk.tokenize.word_tokenize(line)
print(l) # ['[', "'సతతహరిత", "'", ',', "'సమశీతోష్ణ", "'", ',', "'", 'అడవి-ఇల్లు', '*', 'అడవి', "'", ']']
ll=[] #to store new updated token list for i in l: if i[0]=='\'': ix=i.replace('\'','') ll.append(ix) else: ll.append(i)
print(ll) #['[', 'సతతహరిత', '', ',', 'సమశీతోష్ణ', '', ',', '', 'అడవి-ఇల్లు', '*', 'అడవి', '', ']']
i just again processed the result of nltk word tokenizer....It solved my problem.
Was there a regression from https://github.com/nltk/nltk/pull/2018 or did that pr not fix the issue?
nltk version: 3.8.1 python version: 3.10.12
from nltk.tokenize import word_tokenize
sentence = "I've said many times, 'We'll make it through!'"
word_tokenize(sentence)
Expected: ['I', "'ve", 'said', 'many', 'times', ',', "'", "We", "'ll", 'make', 'it', 'through', '!', "'"] Actual: ['I', "'ve", 'said', 'many', 'times', ',', "'We", "'ll", 'make', 'it', 'through', '!', "'"]
It looks like that "fix" was not much of a fix, and the test is not a good test of what was actually requested on this issue. The test only checks this input:
"The 'v', I've been fooled but I'll seek revenge."
It takes 'v'
and makes it into three tokens of '
, v
, and another '
. That part seems okay. But other simple test strings don't tokenize properly:
"The 'blah', I've been fooled but I'll seek revenge."
Here 'blah'
is tokenized into 'blah
and '
. So the issue is still not resolved.
word_tokenize
keeps the opening single quotes and doesn't pad it with space, this is to make sure that the clitics get tokenized as'll
, `'ve', etc.The original treebank tokenizer has the same behavior but Stanford CoreNLP doesn't. It looks like some additional regex was put in to make sure that the opening single quotes get padded with spaces if it isn't followed by clitics.
There should be a non-capturing regex to catch the non-clitics and pad the space.
Details on https://stackoverflow.com/questions/49499770/nltk-word-tokenizer-treats-ending-single-quote-as-a-separate-word/49506436#49506436