grambank / rgrambank

R package to access and analyse Grambank's CLDF data
Apache License 2.0
3 stars 1 forks source link

Function for pruning missing data #37

Closed HedvigS closed 5 months ago

HedvigS commented 9 months ago

I would like to propose a function that removes features and languages in a ValueTable that don't meet a certain cut-off for missing data. The approach is similar but not identical to what we did in grambank-analysed for the release paper. I think this approach is more elegant, easier to understand and overall better. Instead of doing it step-wise, the cropping is done in one sweep in one pipe.

I couldn't get the testthat test to run without using library() at the start and source the relevant script with the function. I welcome help in remedying this. For now I've commented out the library and sourcing in the test script.

xrotwang commented 9 months ago

Removing data based on two orthogonal criteria is problematic, I think. Naively, I'd assume that after pruning with cutoffs x and y, all languages in the dataset will have a coverage >= x% features, and all features will have a coverage >= y% languages. However, that isn't necessarily true, because whatever is done last may also remove datapoints which were necessary for the other threshold to be cleared.

So I'd think, the only way pruning based on low coverage can be supported in a transparent way is by having two functions, one for each criterion, and letting the user deal with issues due to applying both.

HedvigS commented 9 months ago

Removing data based on two orthogonal criteria is problematic, I think. Naively, I'd assume that after pruning with cutoffs x and y, all languages in the dataset will have a coverage >= x% features, and all features will have a coverage >= y% languages. However, that isn't necessarily true, because whatever is done last may also remove datapoints which were necessary for the other threshold to be cleared.

So I'd think, the only way pruning based on low coverage can be supported in a transparent way is by having two functions, one for each criterion, and letting the user deal with issues due to applying both.

The number of languages per feature and features per language are both calculated before any cropping. The number of features per language is not calculated again after features with a lot of missing data have been removed. I think this is the intuitive way of going about this process, and I think splitting this function in twain would cause confusion.

For functions in rgrambank I strive to make them easy to use for our known users and as similar to what we did in the release paper (grambank-analysed scripts) as possible, since many of our colleagues want to write "I followed the same principles as in Skirgård et al 2023".

I think this function suggested in this PR is intuitive, I think splitting would be harder for many users. I will ask in RDM helpdesk what behaviour other users would expect.

@SimonGreenhill is the final arbiter of functions in rgrambank, I will leave it to him ultimately.

HedvigS commented 9 months ago

To illustrate, here's a code snipped visualising the before and after this function that is suggested here

library(tidyverse)
library(Amelia)
library(rcldf) #package for reading in CLDF-datasets
library(rgrambank) #package with functions for common tasks with grambank-data
library(reshape2) # not necessary package for grambank, but used below for making glottolog ValueTable wide before joining to Grambank as an example.

grambank_cldf_object <- rcldf::cldf(mdpath = "https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip", 
                                    load_bib = FALSE) 

before_cropping <- grambank_cldf_object$tables$ValueTable %>% 
    filter(Value != "?") %>% 
    reshape2::dcast(Language_ID ~ Parameter_ID, value.var = "Value") %>% 
    Amelia::missmap()

source("../rgrambank/R/crop_missing_data.R")

grambank_ValueTable_cropped <- grambank_cldf_object$tables$ValueTable %>% crop_missing_data()

after_cropping <- grambank_ValueTable_cropped %>% 
    reshape2::dcast(Language_ID ~ Parameter_ID, value.var = "Value") %>% 
    Amelia::missmap()

Before cropping, the data-set has 25% missing data, 2,467 languages and 195 features. After cropping, the data-set has 4% missing data, 1,526 languages and 108 features.

Before Rplot_before_crop

After

Rplot_after_crop

HedvigS commented 9 months ago

@SimonGreenhill perhaps we should not have this function at all and instead encourage users to crop as they see fit for each function necessary. In the meantime, I'll implement an approach to make_theo_scores() that crops per set of features missing values. I'll make a PR for it in a sec.

HedvigS commented 5 months ago

It seems @SimonGreenhill doesn't want this function from the grambank release paper in rgrambank, I'm closing the PR.