Revise language options for child registration

Transferring this task over from a Slack conversation to make it a bit easier to track. Partial conversation history:

Kiley: Hi everyone, we noticed that Cantonese is not on the list of languages in the demographic form. A large percentage of the West Coast East Asian population speaks Cantonese. Is it possible to add this language? We also noticed a lack of Indian dialects that our parents regularly report speaking - I can send our language form if you guys are up for adding more?

@kimberscott : That sounds like a great idea to clarify the language form. When we set it up I think @rico used the first 2^N most commonly spoken languages (either as L1 or total) but the names don’t necessary map on to common usage - e.g. Cantonese is apparently a variety of Yue, but I’m not sure whether e.g. American Cantonese speakers generally know that.

@kimberscott: Any chance you’d be up for comparing the language lists and recommending (a) any changes to the labels - e.g., “Yue -> Cantonese (or other Yue)” or just “Yue -> Cantonese” (b) any languages that are genuinely not on there even at the wrong “level” ? Otherwise if you can make an issue on GitHub about this that’d be great - this is definitely worth clarifying but (correct me if I’m wrong @rico) not entirely trivial.

Francis: Hello! Just a quick update to this thread: this turns out to be a huge task so it's taking longer than I anticipated because I have to cross check many languages!

It occurred to me that a more elegant solution to this problem might to be add an 'other' option for the language exposure for parents to enter language manually? Using the example of parents not knowing Cantonese is a variety of Yue, it would make more sense to give parents the option to manually enter Cantonese then have a researcher categorize at the back end rather than have duplicate languages? Also, having an exhaustive list of languages will make the page incredibly messy. What do you think @Kim? Is this a viable feature request?

@kimberscott : It’s actually very much worth the initial investment of time to have checkboxes that reflect the expected set of responses, rather than relying on ongoing manual re-categorization. (To get into the weeds a bit - if we add “other” with the intention of that covering common languages like Cantonese, in addition to the additional field which won’t automatically work with eligibility criteria and eventually translations, we’d also need to set up an interface for researchers (which?) to edit child data and an arrangement with someone to keep on top of that.) I think an “other” option might make sense in addition, but not in place of improving the options.

This perspective brought to you in part by the free-response ‘race’ field on the Lookit prototype, which allowed me the fun of hand-categorizing free responses :) (We got a lot of responses like “American” and “Muslim” that were impossible to categorize, plus the expected 20 ways to spell caucasian.)

I do think we want to avoid having duplicate languages, though, so rather than adding Cantonese (for example) I’d suggest replacing the current Yue option (if you think that’s appropriate) with something clearer.

Francis: Thank you for the feedback! Right, I hadn't thought about the automatic categorization and eligibility part. I think one thing I should check with you (and other developers of Lookit!) is whether the intent of the child profile was to include languages that are less common, or just the most common languages for eligibility purposes. I am hesitant to propose any additions to the language list based on our Canadian census (seems slightly Canada-centric to do so, haha!) since we may be introducing many languages that are spoken by a very small population (e.g. Indigenous languages), thus expanding the list to an overwhelming length. On the other hand, I also notice that on the existing list there are quite a few uncommon languages already, so maybe it would be worthwhile to propose additions of other languages? I don't want to step on anyone's toes here so please let me know what you and the Lookit team would prefer!

@kimberscott: That’s a great question about the purpose of the child profile language question. It’s essentially data storage we added ahead of the actual use (e.g. we don’t actually have anyone filtering on kids speaking particular language combinations yet) and so the goals aren’t as well-defined as they might ideally be. There are several potential purposes:

Let researchers set eligibility criteria for studies based on language (e.g. “speaks at least this language so they’ll understand the stimuli” or “speaks one of these six exact combinations because we’re interested in the difference between speaking two very similar vs two very different languages”). Here recording the most common languages is probably roughly appropriate, with some additional verification likely within specific studies.
Let researchers set eligibility criteria based on the number of languages a child speaks, for e.g. studies of how monolingual language development differs. Here recording any language a child might speak is more important - we don’t want to assume a kid only speaks one language because they could only find one of their languages on the list.
Let Lookit select an appropriate translation of a study based on which translations are available and which languages the child speaks (this is substantially further off and needs more scoping - rough outline of things to consider here https://github.com/lookit/lookit-api/issues/181)
Get a rough estimate of which languages Lookit participants speak, to evaluate representativeness and get a sense of what studies about specific languages would be possible with the existing userbase, and also provide this information to researchers running particular studies who may not care about language background specifically but might want to report it along with other demographic information in case it turns out to matter for replication
Make sure all families feel seen/heard

Given that Lookit studies are currently in English only, I’m on board with adding more explicit options for languages that are commonly spoken in conjunction with English - e.g., in Canada. Upon writing those out, it does seem it’d probably be worth also having an “other” option even if we don’t do anything with it right away.

We should also add the first N most used sign languages.

But I would like to do it in a way that doesn’t make the language section too overwhelming. Options include eliminating some of the least commonly spoken languages on that list and focus more on languages that we see more often, or doing some of this in the UI (e.g., show the most commonly spoken languages along with a “show more” button). We got the initial list from here (and used the first 64) if that’s useful for comparing proposed languages in terms of total number of speakers.

I think there are becoming several distinct pieces here - (a) clarifying languages that are present but listed under maybe-less-used names, (b) reviewing the list and adding options that are common enough in your participant population to include - including sign languages, (c) adding an other option and making sure the UI isn’t now overwhelming.

lookit / lookit-api

Revise language options for child registration #491