langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

possible duplicate uni_lemmas #210

Closed kachergis closed 2 years ago

kachergis commented 3 years ago

This is open to debate, but in my mind these uni_lemmas are duplicates (violating the goal of the uni_lemma definition): cap = hat underpants = underwear foot = feet (I assume we want singular?) nail = nails (also ambiguous: do these all refer to body part?) stone = rock street = road beautiful = pretty cup = glass (I feel less strongly about this)

For the below, I'm curious whether others would consider the gloss -> uni_lemma mappings to be similar enough: Portuguese_European_WG has 'Coca-cola': can we map that to 'soda'? (IMO, yes)

Should 'sneakers/trainers' map to 'shoes'? Again, I'd say yes.

A more grammatical one: can 'your' (Port: 'teu') map to 'yours' (nobody has defined a 'your' uni_lemma)? I'd say yes (but we do have 'my' and 'mine' as uni_lemmas: is that duplication?).

alvinwmtan commented 3 years ago

On pronouns: it may be worth extracting all the pronouns cross-linguistically to look for the best representation. For example, because uni_lemmas are in English, there's a lost distinction between second-person singular and plural pronouns (e.g. tu vs vous in French), despite the fact that literally all wordbank languages other than English have this distinction.

A potential option is to use a linguistic gloss for these pronouns (e.g. "I" = 1sg.subj, "me" = 1sg.obj, "my" = 1sg.poss, "mine" = 1sg.poss.obj), although a question that arises is how to treat languages that make fewer differentiations (e.g. "you" = 2?? {i.e. 2sg.subj + 2sg.obj + 2pl.subj + 2pl.obj}).

mcfrank commented 3 years ago

I think the linguistic gloss is a great idea Alvin!! thanks.

re the fewer differentiations, we could have an inclusion hierarchy, e.g. 1sg.subj is the top level, so if a language doesn't distinguish, you just list that as the unilemma, but then you can list more otherwise?

On Tue, May 4, 2021 at 8:06 AM Alvin Tan @.***> wrote:

On pronouns: it may be worth extracting all the pronouns cross-linguistically to look for the best representation. For example, because uni_lemmas are in English, there's a lost distinction between second-person singular and plural pronouns (e.g. tu vs vous in French), despite the fact that literally all wordbank languages other than English have this distinction.

A potential option is to use a linguistic gloss for these pronouns (e.g. "I" = 1sg.subj, "me" = 1sg.obj, "my" = 1sg.poss, "mine" = 1sg.poss.obj), although a question that arises is how to treat languages that make fewer differentiations (e.g. "you" = 2?? {i.e. 2sg.subj + 2sg.obj + 2pl.subj + 2pl.obj}).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/210#issuecomment-832015245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI25F52YBIT6M2PGCBUX23TMAEO5ANCNFSM43YBW6YA .

alvinwmtan commented 3 years ago

Other potential uni_lemmas to collapse: sauce <- soy sauce coat = jacket

kachergis commented 3 years ago

Thanks, Alvin! I do like the idea of the linguistic gloss, although I would also like to somehow not lose readability for non-linguists. And Mike's suggestion of an inclusion hierarchy seems right! Maybe we could effectively do this by just adding a second column for "looser" matches...

BrennanNick commented 3 years ago

Just bumping my slack message here: I completed an updated list of problematic unilemmas in table form with the cited problem, and if I could think of one, a proposed solution. You'll also find a few comments I made about certain groups of unilemmas which I found questionable (such as Russian's adverbs and Cantonese's high concentration of unique unilemmas which only appear in that dataset) Hopefully you will find this helpful and do tell me if there are any changes/additional progress you would like to be made.

https://docs.google.com/document/d/1gyfKCHykQQTtJQTfKOWCY-RL5Ey56R2XS_vgsdNXIAc/edit?pli=1

kachergis commented 3 years ago

Thanks, Brennan! You raise many good points, some of which the group will need to discuss -- probably next Thurs at 9am: you're welcome to join if you can. Meanwhile, we've been working on getting the per-language lists ready to send out for native speakers to update, here: https://docs.google.com/spreadsheets/d/1RcpMgnjSA0nRbym0iDYcBPjL48IMNQxawmil0d4txsA/edit?usp=sharing (The first tab also has the complete bank of uni-lemmas, which is not yet updated although we'd like to finish that before sending out.)

On Wed, Sep 15, 2021 at 12:55 PM BrennanNick @.***> wrote:

Just bumping my slack message here: I completed an updated list of problematic unilemmas in table form with the cited problem, and if I could think of one, a proposed solution. You'll also find a few comments I made about certain groups of unilemmas which I found questionable (such as Russian's adverbs and Cantonese's high concentration of unique unilemmas which only appear in that dataset) Hopefully you will find this helpful and do tell me if there are any changes/additional progress you would like to be made.

https://docs.google.com/document/d/1gyfKCHykQQTtJQTfKOWCY-RL5Ey56R2XS_vgsdNXIAc/edit?pli=1

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/210#issuecomment-920331356, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVUWP4ZFY623MH4PYEL36DUCD2ZJANCNFSM43YBW6YA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mikabr commented 2 years ago

has all this been sorted out?

kachergis commented 2 years ago

Yes, @alvinwmtan and I completely overhauled them -- can't claim they're perfect, but they're 76% closer to it! ;)