rewire-online / edos

Public repository for SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS)
Creative Commons Zero v1.0 Universal
21 stars 6 forks source link

Would it be possible to include date information? #1

Open johann-petrak opened 1 year ago

johann-petrak commented 1 year ago

Since topics, NEs etc. related to sexist remarks change over time, it would be interesting to have date information associated with the labeled and unlabeled texts. Would that be possible?

paul-rottger commented 1 year ago

Hi Johann! Thanks for raising this and sorry for the late reply.

We cannot easily provide timestamps, but even if we could, I would not expect that information to be super useful because of how we sampled the data. All the comments we collected from Gab and Reddit were originally posted between August 2016 and October 2018. This is the time span of the Gab dump we used, and we chose to match it on Reddit. Beyond that, we did not account for time in our sampling. Therefore, there is likely a strong imbalance across time periods correlated with general activity (e.g. Gab started in 2016 and was much more active in 2018). Also, each individual month will have relatively little data. Based on my experience in other work, you need a fairly large and well-structured dataset to meaningfully investigate language change.

Sorry to not have better news! This is a super interesting problem, but not one we set up to investigate with this dataset.

Cheers, Paul

johann-petrak commented 1 year ago

Thank you! I had just been wondering if it would be technically possible to trace back the date, because it often turns out that datasets like this one can get used in downstream research for different research aims. Even if the time span is not that long or the data a bit sparse, I have seen data like this to e.g. get combined with other data in which case date information can be useful. Also equal distribution over time is not necessarily needed for all kinds of research, just knowing which time period the texts are from would be extremely useful. So if there is a technical way to add date information, I still think it could greatly benefit the the research community eventually. This would be useful to have for both the labeled and unlabeled data.