langcog / wordbank

open repository of children's vocabulary data
http://wordbank.stanford.edu
GNU General Public License v2.0
64 stars 10 forks source link

arabic data #295

Closed mcfrank closed 5 months ago

mcfrank commented 1 year ago

import arabic data from https://github.com/langcog/ArabicCAT

alvinwmtan commented 11 months ago

Processed files: [ArabicSaudi_WG].csv [ArabicSaudi_WS].csv ArabicSaudiWG_Alroqi_data.csv ArabicSaudiWG_Alroqi_fields.csv ArabicSaudiWG_Alroqi_values.csv ArabicSaudiWS_Alroqi_data.csv ArabicSaudiWS_Alroqi_fields.csv ArabicSaudiWS_Alroqi_values.csv ArabicSaudiWS_JISH_data.csv ArabicSaudiWS_JISH_fields.csv ArabicSaudiWS_JISH_values.csv ArabicSaudi_notes.md

@mcfrank Just checking, this language should be labelled "Arabic (Saudi)"? And also will need contributors / citations for these data (:

mcfrank commented 11 months ago

Thanks! This is Arabic (Saudi), and the citation for the JISH data is the manual listed on the CDI website. For the other dataset, I just forwarded all the info I have.

Mike

On Sun, Jul 30, 2023 at 10:39 AM Alvin Tan @.***> wrote:

Processed files: [ArabicSaudi_WG].csv https://github.com/langcog/wordbank/files/12209419/ArabicSaudi_WG.csv [ArabicSaudi_WS].csv https://github.com/langcog/wordbank/files/12209420/ArabicSaudi_WS.csv ArabicSaudiWG_Alroqi_data.csv https://github.com/langcog/wordbank/files/12209421/ArabicSaudiWG_Alroqi_data.csv ArabicSaudiWG_Alroqi_fields.csv https://github.com/langcog/wordbank/files/12209422/ArabicSaudiWG_Alroqi_fields.csv ArabicSaudiWG_Alroqi_values.csv https://github.com/langcog/wordbank/files/12209423/ArabicSaudiWG_Alroqi_values.csv ArabicSaudiWS_Alroqi_data.csv https://github.com/langcog/wordbank/files/12209424/ArabicSaudiWS_Alroqi_data.csv ArabicSaudiWS_Alroqi_fields.csv https://github.com/langcog/wordbank/files/12209425/ArabicSaudiWS_Alroqi_fields.csv ArabicSaudiWS_Alroqi_values.csv https://github.com/langcog/wordbank/files/12209426/ArabicSaudiWS_Alroqi_values.csv ArabicSaudiWS_JISH_data.csv https://github.com/langcog/wordbank/files/12209427/ArabicSaudiWS_JISH_data.csv ArabicSaudiWS_JISH_fields.csv https://github.com/langcog/wordbank/files/12209428/ArabicSaudiWS_JISH_fields.csv ArabicSaudiWS_JISH_values.csv https://github.com/langcog/wordbank/files/12209429/ArabicSaudiWS_JISH_values.csv ArabicSaudi_notes.md https://github.com/langcog/wordbank/files/12209430/ArabicSaudi_notes.md

@mcfrank https://github.com/mcfrank Just checking, this language should be labelled "Arabic (Saudi)"? And also will need contributors / citations for these data (:

— Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/295#issuecomment-1657190314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI25F7PJZEZWCGIOGKRJC3XSZWZBANCNFSM6AAAAAAX6L5NHY . You are receiving this because you were mentioned.Message ID: @.***>

alvinwmtan commented 11 months ago

JISH: Contributor: Jeddah Institute for Speech and Hearing Citation: Dashash, N., & Safi, S. (2014). JISH Arabic Communicative Development Inventory: Saudi population JACDI: User’s guide and technical manual. Jeddah: Jeddah Institute for Speech and Hearing

Alroqi: Contributors: Haifa Alroqi, King Abdulaziz University Alaa Almohammadi, King Abdulaziz University Khadeejah Alaslani, Purdue University Citation: TBD

HenryMehta commented 7 months ago

@alvinwmtan I've started on Arabic (Saudi).

A couple of problems. WS is too big to create a database row. There are 1079 items. The program creates a 15 character text field for each and this is too big a database row for MySQL which is the database we're using. I'm trying to find a solution but no progress yet (and I'm not confident).

WG has a new category (negation_words). I need to add this to the categories.csv file. I need to add it with a lexical_category and a lexical_class. I have used function_words for both for the time being as this seems to be used quite a lot.

Finally, some of the cells have "Understands ONLY, Understands & Says" in them. They should be one or the other. No cells have them reversed so I think this is the actual value. I can link these so that these result in produces BUT I will need to amend the file so these use a semi-colon instead of comma because the comma specifies a different field.

alvinwmtan commented 7 months ago

@HenryMehta

HenryMehta commented 7 months ago

@alvinwmtan

Arabic (Saudi) WG is now available to test.

I cannot load WS until we have a decision about whether we could us u instread of understands and p instead of produces. This would need to apply across all datasets and would impact the shiny app as previously mentioned

alvinwmtan commented 7 months ago

(fixing by switching to "u" and "p", as in #298)

mcfrank commented 7 months ago

I endorse this suggestion since it may come up again and will generally save space. But we do need to update the shiny apps as noted. @mikabr may need to update. Will we need to change all instruments or are "understands" and "u" now both options?

On Mon, Dec 4, 2023 at 1:27 PM Alvin Tan @.***> wrote:

(fixing by switching to "u" and "p", as in #298 https://github.com/langcog/wordbank/issues/298)

— Reply to this email directly, view it on GitHub https://github.com/langcog/wordbank/issues/295#issuecomment-1839505370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI25F3R3LYKHO6536HFKRDYHY52PAVCNFSM6AAAAAAX6L5NH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGUYDKMZXGA . You are receiving this because you were mentioned.Message ID: @.***>

HenryMehta commented 7 months ago

@alvinwmtan We still have an issue here. I am now getting an error message of "Too many columns". I've done some reading about this and I cannot increased parameters to allow more fields. I therefore propose we amend the Arabic (Saudi) WS to be 2 files and hence 2 tables.

HenryMehta commented 7 months ago

I endorse this suggestion since it may come up again and will generally save space. But we do need to update the shiny apps as noted. @mikabr may need to update. Will we need to change all instruments or are "understands" and "u" now both options? On Mon, Dec 4, 2023 at 1:27 PM Alvin Tan @.> wrote: (fixing by switching to "u" and "p", as in #298 <#298>) — Reply to this email directly, view it on GitHub <#295 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI25F3R3LYKHO6536HFKRDYHY52PAVCNFSM6AAAAAAX6L5NH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGUYDKMZXGA . You are receiving this because you were mentioned.Message ID: @.>

@mcfrank @alvinwmtan For now I've applied it to French (French) WS plus all future instruments added

alvinwmtan commented 7 months ago

@HenryMehta Hm okay. Do you know what the column limit is?

HenryMehta commented 7 months ago

@alvinwmtan It's not actually that simple because it also depends on the column names. I could probably work out but would take some time. I think we should aim to keepthe max to 750

alvinwmtan commented 7 months ago

@HenryMehta Given that the size of the col names also matters, do you think it might be possible to retain the full table if we converted all the colnames to just numbers? That would reduce the size. If not I'll think about how to split the dataset up.

HenryMehta commented 7 months ago

@alvinwmtan We could try but I don't know how many columns that would give us and the names would actually need changing for every study because of the way the application works. We would need to change the code as well because column names are current called 'item_xx', where xx is the column number. We could reduce it name to 'ixx' because columns names must start with a letter

alvinwmtan commented 7 months ago

@HenryMehta Here is one attempt: I've separated the words (WS) and all other item types (WSOther); WS still has >800 items but hopefully it will be okay. The WS from Alroqi is unchanged. Let me know if this split is still too large and I will find a different solution.

[ArabicSaudi_WS].csv [ArabicSaudi_WSOther].csv ArabicSaudiWS_JISH_data.csv ArabicSaudiWS_JISH_fields.csv ArabicSaudiWS_JISH_values.csv ArabicSaudiWSOther_JISH_data.csv ArabicSaudiWSOther_JISH_fields.csv ArabicSaudiWSOther_JISH_values.csv

HenryMehta commented 7 months ago

@alvinwmtan You've split the JISH files but not the Alroqi

alvinwmtan commented 7 months ago

@HenryMehta I believe the Alroqi files are all still within "WS" (only the JISH had items that now fall in "WSOther")

HenryMehta commented 7 months ago

OK

HenryMehta commented 7 months ago

@alvinwmtan Deploying to dev now - will need about 40 minutes to load

mikabr commented 7 months ago

I've implemented allowing "u" and "p" values in wordbankr. but none of the Saudi Arabic tables seem to have those values, and the WSOther table seems to have zero rows (I'm connecting to wordbank2-dev-3).

alvinwmtan commented 7 months ago

@HenryMehta WS looks good, don't seem to see any WSOther data

HenryMehta commented 7 months ago

@alvinwmtan try now

alvinwmtan commented 7 months ago

@HenryMehta WS and WSOther look good. I realised I also failed to disambiguate some of the items in the WG; these should be de-conflicted now:

ArabicSaudiWG_Alroqi_data.csv ArabicSaudiWG_Alroqi_fields.csv

HenryMehta commented 7 months ago

@alvinwmtan You've re-introduced the cells with "understands only, understands & says" instead of just one. I have previously changed these to "understands & says". I have reapplied this change

alvinwmtan commented 7 months ago

@HenryMehta thanks for catching that; looks good to me now!