grambank / rgrambank

R package to access and analyse Grambank's CLDF data
Apache License 2.0
3 stars 1 forks source link

refactor language_level_df #17

Closed SimonGreenhill closed 1 year ago

SimonGreenhill commented 1 year ago

Remove the hardcoded table from glottolog. We should decouple this because glottolog evolves, and we don't want to be updating this every single time glottolog releases a new version. The package here has a function to easily load_glottolog() so we can make users use that e.g.

HedvigS commented 1 year ago

Okay, I see.

Do you want language_level_df() to continue living here in rgrambank or move over to rcldf? The function is written so that if there's a ValueTable and a LanguageTable with the ID and Language_ID/Language_level_ID() it'll work, regardless if it's grambank or some other cldf-dataset with ValueTable.

xrotwang commented 1 year ago

AFAIKT, the function also relies on Language_ID in ValueTable being Glottocodes. This is an assumption that's not generally true for CLDF datasets, so I'd say the function cannot move to rcldf.

HedvigS commented 1 year ago

AFAIKT, the function also relies on Language_ID in ValueTable being Glottocodes. This is an assumption that's not generally true for CLDF datasets, so I'd say the function cannot move to rcldf.

Currently, it assumes that Language_ID in ValueTable matches Glottocode in LanguageTable. This would not be true for other cldf-datasets. If @SimonGreenhill would be happy to support moving language_level_df() to rcldf, I'll rejig that. i'll probably rejig that anyway, but it'd be good to know if it'd stay here or move.

It's not difficult to change this, I'll take a stab at it right now.

xrotwang commented 1 year ago

@HedvigS the other reason why this function shouldn't move to rcldf is that there is no notion of "Language_level_ID" in CLDF. rcldf should be concerned only with columns that are specified in https://cldf.clld.org/v1.0/terms.html (And while Language_level_ID can be considered a "languageReference" property, the semantics of this relation are not fixed in CLDF if the reference appears in a non-standard place. So a "languageReference" column in ValueTable relates the value to a language it applies to, but a "languagereference" in LanguageTable may be different things, the direct parent, or the top-level family, ...)

HedvigS commented 1 year ago

@HedvigS the other reason why this function shouldn't move to rcldf is that there is no notion of "Language_level_ID" in CLDF. rcldf should be concerned only with columns that are specified in https://cldf.clld.org/v1.0/terms.html (And while Language_level_ID can be considered a "languageReference" property, the semantics of this relation are not fixed in CLDF if the reference appears in a non-standard place. So a "languageReference" column in ValueTable relates the value to a language it applies to, but a "languagereference" in LanguageTable may be different things, the direct parent, or the top-level family, ...)

Yes, you're right. I was thinking a bit about this last night. I was thinking that the function could be modified so if the LanguageTable of a given CLDF-dataset doesn't contain a Language_level_ID type column, it could first match ID to Glottocode in that LanguageTable and then use another reference table to get Language_level_ID for each Glottocode, for example just glottolog-cldf. That's something I think would improve the function, whether it lives in rgrambank or rcldf. Regardless of where it lives, I know that I in future would appreciate such a function. If @SimonGreenhill decides that it continues to live in rgrambank, that's okay I guess I'll just use rgrambank functions on non-grambank cldf-data. I use functions from psychology and biology on linguistics data already, so even if it's a bit funny it'll be okay :)

SimonGreenhill commented 1 year ago

done in #22