Open pascal-mueller opened 1 year ago
You can use the NLTK TreebankWordDetokenizer
for this, see also this SO answer. However, it is not always guaranteed, as the tokenization process is not bijective.
Couldn't I enable a tracking feature, that basically acts as a bijective map? E.g. the start and end index of the token in the original string? If not, couldn't I implement it myself rather easily, if so, where would that be?
That would be possible, yes. The current approach is not bijective because it uses regular expressions to normalize some of the whitespace. Once tokenized, we only have e.g. ["James", "'"]
, and nothing to know for certain whether this was originally James'
, James '
or even James '
. This results in some discrepancies between the original text and the tokenized and then detokenized texts, which is mitigated somewhat by various rules. For example, 's
or 'll
will always be appended directly to the previous token:
https://github.com/nltk/nltk/blob/56bc4af35906fb636c11d0cbc3c8ea54447def24/nltk/tokenize/treebank.py#L289-L290
Note that this Detokenizer was designed to be used on already tokenized text, from which we can't extract any more information about the whitespace. Also consider that depending on your use case, there may already be an existing approach that you can use to extract tokens and still preserve the original text. spaCy jumps to mind.
Hello,
so I'm using the English tokenizer and noticed, that the sentences I get back are trimmed. I was wondering, if I somehow can reconstruct the text from the tokens reliably.
If not, is it possible to maybe write my own "pickle" file that doesn't do the trimming, resp. lets me overwrite the default tokenizer? Or maybe there are hooks or a plugin system I didn't see?
Thanks in advance.