pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 815 forks source link

Problem with data.Field preprocessing #388

Open davidalbertonogueira opened 5 years ago

davidalbertonogueira commented 5 years ago

As it can be seen in the code sample below, we get different results if

Using the Field preprocessing pipeline, the _text_processor_ will be called on a token level, instead of on a sentence-level, as it was assumed.

The comment for the argument alerts for that fact " The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. "

However, such behavior prevents preprocessing tools like the https://github.com/cbaziotis/ekphrasis from converting expressions like "October 10th" to <date>, and others.

I would, therefore, suggest adding another argument to receive a Pipeline to be applied to examples before tokenizing.

Example:

from torchtext import data, vocab
import torch.optim as optim
import torch.nn.functional as F
import torch.nn as nn
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons

text_processor = TextPreProcessor(
            # terms that will be normalized
            normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
                'time', 'url', 'date', 'number'],
            # terms that will be annotated
            annotate={"hashtag", "allcaps", "elongated", "repeated",
                'emphasis', 'censored'},
            fix_html=True,  # fix HTML tokens

            # corpus from which the word statistics are going to be used
            # for word segmentation
            segmenter="twitter",

            # corpus from which the word statistics are going to be used
            # for spell correction
            corrector="twitter",

            unpack_hashtags=True,  # perform word segmentation on hashtags
            unpack_contractions=True,  # Unpack contractions (can't -> can not)
            spell_correct_elong=False,  # spell correction for elongated words

            # select a tokenizer. You can use SocialTokenizer, or pass your own
            # the tokenizer, should take as input a string and return a list of tokens
            tokenizer=SocialTokenizer(lowercase=True).tokenize,

            # list of dictionaries, for replacing tokens extracted from the text,
            # with other expressions. You can pass more than one dictionaries.
            dicts=[emoticons]
        )

Reading twitter - 1grams ... Reading twitter - 2grams ... Reading twitter - 1grams ...

>>> def custom_processing(x, text_processor):
...    text = " ".join(text_processor.pre_process_doc(x))
...    return text

>>> text = "That Mexico vs USA commercial with trump gets your blood boiling. Race war October 10th. Imagine that parking lot. Gaddamnnnnnn VIOLENCE!!!"

>>> text_to_process = data.Field(preprocessing=data.Pipeline(lambda x : custom_processing(x, text_processor) ) )
>>> Dataset_input = [data.Example.fromlist(data=[text],fields=[('text', text_to_process)])]
>>> Dataset_input[0]

<torchtext.data.example.Example object at 0x000001ACFA71A748>

>>> Dataset_input[0].text

['that', 'mexico', 'vs', '<allcaps> usa </allcaps>', 'commercial', 'with', 'trump', 'gets', 'your', 'blood', 'boiling .', 'race', 'war', 'october', '1 0 th .', 'imagine', 'that', 'parking', 'lot .', 'gaddamn <elongated>', '<allcaps> violence </allcaps> ! <repeated>']

>>> processed_text = " ".join(text_processor.pre_process_doc(text))
>>> processed_text

'that mexico vs <allcaps> usa </allcaps> commercial with trump gets your blood boiling . race war <date> . imagine that parking lot . gaddamn <elongated> <allcaps> violence </allcaps> ! <repeated>'

>>> print( " ".join(text_processor.pre_process_doc("10th")))

1 0 th
mttk commented 5 years ago

I see. A workaround for the time being would be to define your own tokenizer, which you could create as following:

custom_processing = ...
spacy_tokenizer = torchtext.data.utils.get_tokenizer('spacy') # just an example

def my_tokenizer(example):
  preprocessed_example = custom_processing(example)
  tokenized_example = spacy_tokenizer(preprocessed_example.rstrip('\n'))
  return tokenized_example

And you can simply pass tokenizer = my_tokenizer to the Field constructor. I haven't tried this, but it should work.