langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

French CAT data #298

Closed alvinwmtan closed 5 months ago

alvinwmtan commented 11 months ago

Processed files from Sho Tsuji: [French_French_WG].csv [French_French_WS].csv FrenchFrenchWG_Tsuji_data.csv FrenchFrenchWG_Tsuji_fields.csv FrenchFrenchWG_Tsuji_values.csv FrenchFrenchWS_Tsuji_data.csv FrenchFrenchWS_Tsuji_fields.csv FrenchFrenchWS_Tsuji_values.csv FrenchFrenchWS_TsujiLabvanced_data.csv FrenchFrenchWS_TsujiLabvanced_fields.csv FrenchFrenchWS_TsujiLabvanced_values.csv FrenchFrench_notes.md

@kachergis Can we get citations for these data? (:

alvinwmtan commented 11 months ago

Ah sorry, citation was in #260:

Mireille Babineau (Department of Psychology, University of Toronto (St. George)), Alex de Carvalho (Université Paris Cité, CNRS, LaPsyDÉ, F-75005 Paris, France) Anne-Caroline Fievet, (LSCP (ENS, EHESS, CNRS), Département d’Etudes Cognitives, ENS - PSL Research University), Cécile Crimon (LSCP (ENS, EHESS, CNRS), Département d’Etudes Cognitives, ENS - PSL Research University; Université Paris Cité), Sho Tsuji (International Research Center for Neurointelligence, The University of Tokyo)

I think we just list this as:

Babineau, M., de Carvalho, A., Fievet, A., Crimon, C., and Tsuji, S. (unpublished).

HenryMehta commented 7 months ago

@alvinwmtan The [French_French_WG].csv does not match the existing file for this instrument

alvinwmtan commented 7 months ago

@HenryMehta I appended four new items to the existing file for WG (1x first signs, 3x phrases), and one new item for WS (1x combine). The rest of the items should be identical to the existing form definition files.

HenryMehta commented 7 months ago

@alvinwmtan WS looks fine. I am worried about WG (take a look here:https://github.com/langcog/wordbank/blob/master/raw_data/French_French_WG/%5BFrench_French_WG%5D.csv). The file in use is very different from the one you sent. The key data might be the same and I'll simply implement your version if you like.

alvinwmtan commented 7 months ago

@HenryMehta I think the spreadsheet editor I was using changed all the linebreak characters so it looks like every line is modified, but the actual content should be the same, so let's use the new version from this issue.

HenryMehta commented 7 months ago

@alvinwmtan, I will do that. I am trying something to sort out the size issue at the moment, but it has been running on my PC for 36 hours so far so not sure if it will work. I'm going to leave until tomorrow morning before stopping it. I can't work on this while it is running

HenryMehta commented 7 months ago

@alvinwmtan My idea hasn't worked. The row size problem I was talking about with Arabic Saudi is also an issue for row 5 or 6 of FrenchFrenchWS_TsujiLabvanced_values.csv. The problem is the number of 'understands' responses which when added together is taking up more than 8k. I think we could sort this by using u instead of understands, p instead of produces etc. But this would then mean changes are required on the ShinyApp. I will test my approach tomorrow so we know if it is an option

HenryMehta commented 7 months ago

@alvinwmtan

French (French) WG is now available to test.

One French (French) WS file is available to test (FrenchFrenchWS_Tsuji.csv) but the other (FrenchFrenchWS_TsujiLabvanced) is giving storage issues. Much like the Arabic (Saudi), I think we need u and p instead of understands and produces.

alvinwmtan commented 7 months ago

@HenryMehta Let's switch to "u" and "p" (but only in dev for now—when we've got all the dev changes done then we can switch it with prod, along with the Shiny app and wordbankr at the same time).

HenryMehta commented 7 months ago

@alvinwmtan Changing field to 1 character has worked (this also means never, sometimes, often will be n, s, o respectively) has worked for the French (French) WS issue with (FrenchFrenchWS_Tsuji.csv).

Deployed to dev for testing

alvinwmtan commented 7 months ago

@HenryMehta I know some researchers have sometimes/often as a single collapsed category—I suppose we need to use a different character to indicate this category then (in the future)?

HenryMehta commented 7 months ago

I do not believe this is in any of the current wordbank studies. If it is, let me know where and I'll take a look

alvinwmtan commented 7 months ago

@HenryMehta Sorry, I made a mistake in the WG form definition (accidentally had duplicate item_ids). Fixed here:

[French_French_WG].csv FrenchFrenchWG_Tsuji_fields.csv

Both French WS datasets are good

alvinwmtan commented 7 months ago

@HenryMehta Also, currently both "no" and "not yet/never" are mapped to "n". Can we disambiguate this? (e.g., "not yet" -> "x", "never" -> "v"). Also the same problem for "simple" and "sometimes" -> "s". (e.g., "simple" -> "e")

HenryMehta commented 7 months ago

@HenryMehta Also, currently both "no" and "not yet/never" are mapped to "n". Can we disambiguate this? (e.g., "not yet" -> "x", "never" -> "v"). Also the same problem for "simple" and "sometimes" -> "s". (e.g., "simple" -> "e")

@alvinwmtan Over the Christmas/New Year period, I want to have another go at allowing the 11 characters. I'm going to take a copy of the production database, rebuild the various migration files and try again, so I'm going to leave this for now

HenryMehta commented 7 months ago

@alvinwmtan French (French) WG updated

alvinwmtan commented 7 months ago

@HenryMehta Sorry, one more mistake—there were some items that had the same definition and were conflicting

FrenchFrenchWG_Tsuji_data.csv FrenchFrenchWG_Tsuji_fields.csv

HenryMehta commented 7 months ago

@alvinwmtan deployedto dev

alvinwmtan commented 7 months ago

@HenryMehta looks great!

HenryMehta commented 7 months ago

@alvinwmtan I think the only outstanding item now is the work to try and get back to 11 characters. I'm therefore going to start looking at the tomorrow rather than waiting for Christmas break. I'll also redeploy everything to dev as a practice for moving to production which means I will be changing the dev database. I'll let you know how it is going