gsi-upm / emoevales-iberlef2021

GSI participation at EmoEvalEs - IberLEF 2021
2 stars 1 forks source link

Hashtags removal #1

Open kinoute opened 2 years ago

kinoute commented 2 years ago

Hello,

First, thanks for sharing your work and congratulations on winning the competition. I was wondering why while pre-processing the tweets, you used an apparently custom library to, for instance, get rid of the hashtags and replace all of them by <hashtag>.

It appears, in the original XLM-RoBERTa Twitter model, that only the usernames and URLs are replaced by some tokens, not the hashtags. At least according to their preprocess function:

def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

When using their tokenizer, we can see that the hashtags are indeed splitted correctly and can bring informations about the meaning of the tweet and, maybe, its emotion:

Capture d’écran 2021-12-09 à 15 53 41

Did you try with the hashtags and got worse results?

Thank you!

dveni commented 2 years ago

Hi there and thanks for reaching out!

Indeed, this is one of those decisions that is only backed experimentally, we have tested with and without hashtags getting better results when we filtered them. This is probably overfitting of the competition dataset, so shouldn't be taken as a rule of thumb.