Preprocessing: How to Deal with Text Truncation

tarrade / proj_multilingual_text_classification

Explore multilingal text classification using embedding, bert and deep learning architecture

Apache License 2.0

4 stars 1 forks source link

Preprocessing: How to Deal with Text Truncation #56

Closed vluechinger closed 4 years ago

vluechinger commented 4 years ago

At the moment, we just took the first 510 tokens of each sequence. Other approaches would be to take the last 510 tokens or to do a mix by taking the first 255 and the last 255 tokens. For movie reviews, this could improve the model and is one aspect we could try out.

tarrade commented 4 years ago

https://arxiv.org/abs/2004.05150

tarrade commented 4 years ago

Important for long text. Here we have 95% of the text with the token size. LongTransformer can be an option.