grambank / rgrambank

R package to access and analyse Grambank's CLDF data
Apache License 2.0
3 stars 1 forks source link

language_level_df and drop_duplicate_tips_random #19

Closed HedvigS closed 4 months ago

HedvigS commented 1 year ago

drop_duplicate_tips_random() in #15 drops all but one tip in trees with the same glottocode. It has an argument merge_dialectswhich can be set to TRUE, and if so it replaces tips with the glottocode of their parent language and then does the duplicate dropping. If FALSE, it just takes the glottocodes of the dialects straight and keeps all but one if there are duplicates of that dialect.

language_level_df() currently doesn't have this option, it just merges dialects whether you want or not. I think that's probably fine, but I thought I'd check if anyone has a use case where they want to drop duplicates but NOT merge dialects into one language.

SimonGreenhill commented 1 year ago

ok, reopen if anyone wants this use case (and can show they need it rather than "it might be nice")

HedvigS commented 1 year ago

Well, it happened. @lLEK85 asked me in a chat for this functionality ^^!

HedvigS commented 1 year ago

Apparently, for Numeralbank, it's nice to trim to unique languages and dialects and not merge dialects.

SimonGreenhill commented 1 year ago

ok, but we need more tests for this function -- especially for different function parameter values.

xrotwang commented 1 year ago

I can sort of see the use case to drop "duplicates", but I can't really see why one would want to do this by keeping one random instance. If the same language shows up in different parts of a tree, and I want to use the tree together with other data, doesn't this require a conscious decision regarding which tree node to identify with other data for the language? And even if that decision is "take one at random", it should be part of the analysis code and not just the default of a utility function.

HedvigS commented 1 year ago

ok, but we need more tests for this function -- especially for different function parameter values.

Sure thing, if you're for implementing this I'll get going on modifying the code and making more tests when I can.

HedvigS commented 1 year ago

I can sort of see the use case to drop "duplicates", but I can't really see why one would want to do this by keeping one random instance. If the same language shows up in different parts of a tree, and I want to use the tree together with other data, doesn't this require a conscious decision regarding which tree node to identify with other data for the language? And even if that decision is "take one at random", it should be part of the analysis code and not just the default of a utility function.

Well, there are options. For example, for me with the Austronesian tree from 2009, I went and looked at every case of duplicates in the Oceanic clade and looked at the ABVD supporting data. I manually selected duplicate tips to drop based on not good quality word lists or too few words. However, with the EDGE-tree I don't really have that insight so I'd rather pick at random. It's an analytical decision, it's not going to be best to drop at random duplicate tups for all users and in those cases, they shouldn't use the function. That's why it's good to have "random" in the function name so it signals this to them. I trust users to make the best decisions given their situation.

For the grambank release paper, we used the EDGE-tree and we dropped duplicate tips at random in the manner of the function I'm suggesting here #15 . It's something I forsee us doing for some other projects as well, but as you point out not all.

I've seen other analysis code where people use functions that remove all but the first duplicate that's encountered. This is worse than random in most cases I think. I think providing a handy function for doing this therefore is warranted.

xrotwang commented 1 year ago

I've seen other analysis code where people use functions that remove all but the first duplicate that's encountered. This is worse than random in most cases I think.

That's kind of my point: The procedure used to drop tips should be part of the analysis code so it can easily be spotted and reviewed by a reader - rather than being hidden behind a function in a package, where the justification boils down to "we used function x from y".

HedvigS commented 1 year ago

I've seen other analysis code where people use functions that remove all but the first duplicate that's encountered. This is worse than random in most cases I think.

That's kind of my point: The procedure used to drop tips should be part of the analysis code so it can easily be spotted and reviewed by a reader - rather than being hidden behind a function in a package, where the justification boils down to "we used function x from y".

I disagree. I think that things that need to be done often should be effectivized into a few functions where we trust the package maintainers and reviewers and that this outweighs any concerns about the decisions being "hidden".

Like I said, not all cases of duplicate tips in language trees should use this function. For those cases where it's relevant I think this is a handy function to have.

HedvigS commented 1 year ago

In addition, the particular code for the function drop_duplicate_tips_random is fairly legible to a medium-level-R-user and can easily be inspected with ´getAnywhere ´. Furthermore, the name of the function and argument names are pretty semantically transparent.

I don't see the hinderance to a reviewer.

@SimonGreenhill will decide over in #15

HedvigS commented 9 months ago

If #15 #25 are merged, I think we close this issue.