Open davidalbertonogueira opened 5 years ago
I see. A workaround for the time being would be to define your own tokenizer
, which you could create as following:
custom_processing = ...
spacy_tokenizer = torchtext.data.utils.get_tokenizer('spacy') # just an example
def my_tokenizer(example):
preprocessed_example = custom_processing(example)
tokenized_example = spacy_tokenizer(preprocessed_example.rstrip('\n'))
return tokenized_example
And you can simply pass tokenizer = my_tokenizer
to the Field
constructor. I haven't tried this, but it should work.
As it can be seen in the code sample below, we get different results if
Using the Field preprocessing pipeline, the _text_processor_ will be called on a token level, instead of on a sentence-level, as it was assumed.
The comment for the argument alerts for that fact " The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. "
However, such behavior prevents preprocessing tools like the https://github.com/cbaziotis/ekphrasis from converting expressions like "October 10th" to <date>, and others.
I would, therefore, suggest adding another argument to receive a Pipeline to be applied to examples before tokenizing.
Example:
Reading twitter - 1grams ... Reading twitter - 2grams ... Reading twitter - 1grams ...