Open kinoute opened 2 years ago
Hi there and thanks for reaching out!
Indeed, this is one of those decisions that is only backed experimentally, we have tested with and without hashtags getting better results when we filtered them. This is probably overfitting of the competition dataset, so shouldn't be taken as a rule of thumb.
Hello,
First, thanks for sharing your work and congratulations on winning the competition. I was wondering why while pre-processing the tweets, you used an apparently custom library to, for instance, get rid of the hashtags and replace all of them by
<hashtag>
.It appears, in the original XLM-RoBERTa Twitter model, that only the usernames and URLs are replaced by some tokens, not the hashtags. At least according to their
preprocess
function:When using their tokenizer, we can see that the hashtags are indeed splitted correctly and can bring informations about the meaning of the tweet and, maybe, its emotion:
Did you try with the hashtags and got worse results?
Thank you!