nltk / nltk

NLTK Source
https://www.nltk.org
Apache License 2.0
13.52k stars 2.88k forks source link

word_tokenize keeps the opening single quotes and doesn't pad it with space #1995

Open alvations opened 6 years ago

alvations commented 6 years ago

word_tokenize keeps the opening single quotes and doesn't pad it with space, this is to make sure that the clitics get tokenized as 'll, `'ve', etc.

The original treebank tokenizer has the same behavior but Stanford CoreNLP doesn't. It looks like some additional regex was put in to make sure that the opening single quotes get padded with spaces if it isn't followed by clitics.

There should be a non-capturing regex to catch the non-clitics and pad the space.

Details on https://stackoverflow.com/questions/49499770/nltk-word-tokenizer-treats-ending-single-quote-as-a-separate-word/49506436#49506436

alvations commented 6 years ago

If we make the following changes to word_tokenize at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/__init__.py, it would achieve similar behavior as of Stanford CoreNLP:

import re
from nltk.tokenize.treebank import TreebankWordTokenizer

# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()

# See discussion on 
#     - https://github.com/nltk/nltk/pull/1437
#     - https://github.com/nltk/nltk/issues/1995
# Adding to TreebankWordTokenizer, the splits on
#     - chervon quotes u'\xab' and u'\xbb' .
#     - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
#     - opening single quotes if the token that follows isn't a clitic

improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
improved_close_quote_regex = re.compile(u'([»”’])', re.U)
improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
_treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
_treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
_treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
_treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))

def word_tokenize(text, language='english', preserve_line=False):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into words
    :type text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserver_line: bool
    """
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

[out]:

>>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
>>> word_tokenize("'v' 're'")
["'", 'v', "'", "'re", "'"]

The above regex hack will cover the following clitics:

're
've
'll
'd
't
's
'm

Are there more clitics that should be added?

Lingviston commented 6 years ago

What about the ending single quotes, which appear in the end of the objective form of plurals? Like "providers'".

djinn-anthrope commented 6 years ago

Do "readers' " need to be tokenized to "readers" and "'"? Also, what is the status of this bug so far? If the change mentioned above has not been implemented, I'd like to take up this issue

alvations commented 6 years ago

This issue is on the opening quotes and the clitic fix for that can be easily done and that'll make the word_tokenize behave like Stanford's. IMHO, I think it's a good feature to have.

Feel free to contribute and open a pull-request on it =)


But to handle possessive plural, it's hard to understand how to do it because we need to define the difference between sentences like:

The providers' CEO went on a holiday.
He said, 'The CEO has fired the providers'.
Breaking news: 'Down to all providers'
The 'internet providers' have went on a holiday.

There are too many instances where the possessive plurals that can be confused with closing single quotes. I would say for use of single quotes for the plural possessive, I don't think it's worth fixing.

durgaprasad-palanati-AI commented 4 years ago

import nltk import codecs

file=codecs.open('new2.txt','r','utf8')

fh=file.readlines() #['సతతహరిత', 'సమశీతోష్ణ', ' అడవి-ఇల్లు*అడవి ']-the line is stored in new1.txt

the king's cat is caught with kit's. -the line is stored in new2.txt

for line in fh: l=nltk.tokenize.word_tokenize(line)

print(l) # ['[', "'సతతహరిత", "'", ',', "'సమశీతోష్ణ", "'", ',', "'", 'అడవి-ఇల్లు', '*', 'అడవి', "'", ']']

['\ufeffthe', 'king', "'s", 'cat', 'is', 'caught', 'with', 'kit', "'s", '.']

ll=[] #to store new updated token list for i in l: if i[0]=='\'': ix=i.replace('\'','') ll.append(ix) else: ll.append(i)

updated correct one's

print(ll) #['[', 'సతతహరిత', '', ',', 'సమశీతోష్ణ', '', ',', '', 'అడవి-ఇల్లు', '*', 'అడవి', '', ']']

['\ufeffthe', 'king', 's', 'cat', 'is', 'caught', 'with', 'kit', 's', '.']

durgaprasad-palanati-AI commented 4 years ago

i just again processed the result of nltk word tokenizer....It solved my problem.

But it may not be optimal solution need to update the libraries nltk

th0rntwig commented 5 months ago

Was there a regression from https://github.com/nltk/nltk/pull/2018 or did that pr not fix the issue?

nltk version: 3.8.1 python version: 3.10.12

from nltk.tokenize import word_tokenize

sentence = "I've said many times, 'We'll make it through!'"
word_tokenize(sentence)

Expected: ['I', "'ve", 'said', 'many', 'times', ',', "'", "We", "'ll", 'make', 'it', 'through', '!', "'"] Actual: ['I', "'ve", 'said', 'many', 'times', ',', "'We", "'ll", 'make', 'it', 'through', '!', "'"]

BrenBarn commented 2 days ago

It looks like that "fix" was not much of a fix, and the test is not a good test of what was actually requested on this issue. The test only checks this input:

"The 'v', I've been fooled but I'll seek revenge."

It takes 'v' and makes it into three tokens of ', v, and another '. That part seems okay. But other simple test strings don't tokenize properly:

"The 'blah', I've been fooled but I'll seek revenge."

Here 'blah' is tokenized into 'blah and '. So the issue is still not resolved.