mitjat / langid_eval

Language identification golden dataset of tweets
Creative Commons Zero v1.0 Universal
12 stars 1 forks source link

Possible dataset expansion? #1

Open DonaldTsang opened 4 years ago

DonaldTsang commented 4 years ago

Is it possible to possibly expand on the established data set? https://github.com/wooorm/franc unfortunately used UDHR as the dataset which can skew the model.

mitjat commented 4 years ago

Hey, I'm not sure I understand the question. What kind of expansion specifically would you like to see?

On Sat, Nov 23, 2019 at 6:57 PM Donald Tsang notifications@github.com wrote:

Is it possible to possibly expand on the established data set? https://github.com/wooorm/franc unfortunately used UDHR as the dataset which can skew the model.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mitjat/langid_eval/issues/1?email_source=notifications&email_token=AAEZZUTP54WIETDMNFVK2N3QVFVGZA5CNFSM4JQ3GYMKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H3SLIRA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZZUVJXZCYCVVV6ONDNEDQVFVGZANCNFSM4JQ3GYMA .

DonaldTsang commented 4 years ago

It says "120k rows" but I don't know how many tweets exists per language, maybe some might have less than 2000 or something like that?

P.S. could you store the tweets in plaintext, as sometimes users would delete their own tweets.

mitjat commented 4 years ago

You're talking about uniformly_sampled.tsv, right? You can count the number of tweets per language yourself. For example, in bash: cat uniformly_sampled.tsv | cut -f1 | uniq -c

I would love to be able to attach the plain-text version of the tweets, but Twitter licensing prevents me from doing so. (Deletion is precisely the problem: When a user deletes a tweet, it should disappear from all datasets, even static ones like this one.) I actually don't even have the plain text myself anywhere any more.

On Sun, Nov 24, 2019 at 7:27 AM Donald Tsang notifications@github.com wrote:

It says "120k rows" but I don't know how many tweets exists per language

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mitjat/langid_eval/issues/1?email_source=notifications&email_token=AAEZZURN747K63ZRVCUWV5TQVINEDA5CNFSM4JQ3GYMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAEXVA#issuecomment-557861844, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZZUWDRWFRKSWECEXTCGDQVINEDANCNFSM4JQ3GYMA .

DonaldTsang commented 4 years ago

@mitjat that is unfortunate, if that is the case, is it possible to have an updated list every year to keep up with the Joneses? Or maybe use another dataset that is similar but not have such an issue (maybe news translations or Wikipedia)?