Hashtags removal - Githubissues

Hello,

First, thanks for sharing your work and congratulations on winning the competition. I was wondering why while pre-processing the tweets, you used an apparently custom library to, for instance, get rid of the hashtags and replace all of them by <hashtag>.

It appears, in the original XLM-RoBERTa Twitter model, that only the usernames and URLs are replaced by some tokens, not the hashtags. At least according to their preprocess function:

def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

When using their tokenizer, we can see that the hashtags are indeed splitted correctly and can bring informations about the meaning of the tweet and, maybe, its emotion:

Did you try with the hashtags and got worse results?

Thank you!

gsi-upm / emoevales-iberlef2021

Hashtags removal #1