Closed tradingaddict closed 1 year ago
Hello!
I think I've narrowed this down a bit: With NLTK version 3.5:
>>> detokenize(["He", "said", "''", "heya", "!", "''", "yesterday", "."])
"He said ''heya!'' yesterday."
With NLTK version 3.6.7:
>>> detokenize(["He", "said", "''", "heya", "!", "''", "yesterday", "."])
'He said"heya!" yesterday.'
I'll workshop a quick fix to improve the performance a bit.
I've got this quick testing script:
for sentence in ["Hello, you're Tom!",
'He said "heya!" yesterday.',
'He said \'heya!\' yesterday.',
'He said \'\'heya!\'\' yesterday.',
'He\'s doing well, I think.',
]:
token = tokenize(sentence)
detoken = detokenize(token)
print(detoken)
The new performance is:
Hello, you're Tom!
He said "heya!" yesterday.
He said 'heya! 'yesterday.
He said "heya!" yesterday.
He's doing well, I think.
versus the old performance:
Hello, you're Tom!
He said"heya!" yesterday.
He said 'heya! 'yesterday.
He said"heya!" yesterday.
He's doing well, I think.
(Note: using NLTK 3.6.7)
Thank you for reporting this!
Closed via f994465f0f5c9304ebb9830926f3df130cf9643a
The detokenizer isn't prepending spaces before quotes like it says it should in the Tokenizer.py examples. If I use one of the examples on the detokenizer:
["He", "said", "''", "heya", "!", "''", "yesterday", "."]
it returns:
He said"heya!" yesterday.