nltk / nltk

NLTK Source
https://www.nltk.org
Apache License 2.0
13.56k stars 2.89k forks source link

Constructing Original Text from Sentences #3138

Open pascal-mueller opened 1 year ago

pascal-mueller commented 1 year ago

Hello,

so I'm using the English tokenizer and noticed, that the sentences I get back are trimmed. I was wondering, if I somehow can reconstruct the text from the tokens reliably.

If not, is it possible to maybe write my own "pickle" file that doesn't do the trimming, resp. lets me overwrite the default tokenizer? Or maybe there are hooks or a plugin system I didn't see?

Thanks in advance.

tomaarsen commented 1 year ago

You can use the NLTK TreebankWordDetokenizer for this, see also this SO answer. However, it is not always guaranteed, as the tokenization process is not bijective.

pascal-mueller commented 1 year ago

Couldn't I enable a tracking feature, that basically acts as a bijective map? E.g. the start and end index of the token in the original string? If not, couldn't I implement it myself rather easily, if so, where would that be?

tomaarsen commented 1 year ago

That would be possible, yes. The current approach is not bijective because it uses regular expressions to normalize some of the whitespace. Once tokenized, we only have e.g. ["James", "'"], and nothing to know for certain whether this was originally James', James ' or even James '. This results in some discrepancies between the original text and the tokenized and then detokenized texts, which is mitigated somewhat by various rules. For example, 's or 'll will always be appended directly to the previous token: https://github.com/nltk/nltk/blob/56bc4af35906fb636c11d0cbc3c8ea54447def24/nltk/tokenize/treebank.py#L289-L290

Note that this Detokenizer was designed to be used on already tokenized text, from which we can't extract any more information about the whitespace. Also consider that depending on your use case, there may already be an existing approach that you can use to extract tokens and still preserve the original text. spaCy jumps to mind.