Closed HedvigS closed 4 months ago
ok, reopen if anyone wants this use case (and can show they need it rather than "it might be nice")
Well, it happened. @lLEK85 asked me in a chat for this functionality ^^!
Apparently, for Numeralbank, it's nice to trim to unique languages and dialects and not merge dialects.
ok, but we need more tests for this function -- especially for different function parameter values.
I can sort of see the use case to drop "duplicates", but I can't really see why one would want to do this by keeping one random instance. If the same language shows up in different parts of a tree, and I want to use the tree together with other data, doesn't this require a conscious decision regarding which tree node to identify with other data for the language? And even if that decision is "take one at random", it should be part of the analysis code and not just the default of a utility function.
ok, but we need more tests for this function -- especially for different function parameter values.
Sure thing, if you're for implementing this I'll get going on modifying the code and making more tests when I can.
I can sort of see the use case to drop "duplicates", but I can't really see why one would want to do this by keeping one random instance. If the same language shows up in different parts of a tree, and I want to use the tree together with other data, doesn't this require a conscious decision regarding which tree node to identify with other data for the language? And even if that decision is "take one at random", it should be part of the analysis code and not just the default of a utility function.
Well, there are options. For example, for me with the Austronesian tree from 2009, I went and looked at every case of duplicates in the Oceanic clade and looked at the ABVD supporting data. I manually selected duplicate tips to drop based on not good quality word lists or too few words. However, with the EDGE-tree I don't really have that insight so I'd rather pick at random. It's an analytical decision, it's not going to be best to drop at random duplicate tups for all users and in those cases, they shouldn't use the function. That's why it's good to have "random" in the function name so it signals this to them. I trust users to make the best decisions given their situation.
For the grambank release paper, we used the EDGE-tree and we dropped duplicate tips at random in the manner of the function I'm suggesting here #15 . It's something I forsee us doing for some other projects as well, but as you point out not all.
I've seen other analysis code where people use functions that remove all but the first duplicate that's encountered. This is worse than random in most cases I think. I think providing a handy function for doing this therefore is warranted.
I've seen other analysis code where people use functions that remove all but the first duplicate that's encountered. This is worse than random in most cases I think.
That's kind of my point: The procedure used to drop tips should be part of the analysis code so it can easily be spotted and reviewed by a reader - rather than being hidden behind a function in a package, where the justification boils down to "we used function x from y".
I've seen other analysis code where people use functions that remove all but the first duplicate that's encountered. This is worse than random in most cases I think.
That's kind of my point: The procedure used to drop tips should be part of the analysis code so it can easily be spotted and reviewed by a reader - rather than being hidden behind a function in a package, where the justification boils down to "we used function x from y".
I disagree. I think that things that need to be done often should be effectivized into a few functions where we trust the package maintainers and reviewers and that this outweighs any concerns about the decisions being "hidden".
Like I said, not all cases of duplicate tips in language trees should use this function. For those cases where it's relevant I think this is a handy function to have.
In addition, the particular code for the function drop_duplicate_tips_random
is fairly legible to a medium-level-R-user and can easily be inspected with ´getAnywhere ´. Furthermore, the name of the function and argument names are pretty semantically transparent.
I don't see the hinderance to a reviewer.
@SimonGreenhill will decide over in #15
If #15 #25 are merged, I think we close this issue.
drop_duplicate_tips_random()
in #15 drops all but one tip in trees with the same glottocode. It has an argumentmerge_dialects
which can be set to TRUE, and if so it replaces tips with the glottocode of their parent language and then does the duplicate dropping. If FALSE, it just takes the glottocodes of the dialects straight and keeps all but one if there are duplicates of that dialect.language_level_df()
currently doesn't have this option, it just merges dialects whether you want or not. I think that's probably fine, but I thought I'd check if anyone has a use case where they want to drop duplicates but NOT merge dialects into one language.