grambank / rgrambank

R package to access and analyse Grambank's CLDF data
Apache License 2.0
4 stars 1 forks source link

new function: make binary #4

Closed HedvigS closed 1 year ago

HedvigS commented 1 year ago

Adaptation of https://github.com/grambank/grambank-analysed/blob/main/R_grambank/make_wide.R that takes the language and that binarises the 195 features into 201 features correctly

HedvigS commented 1 year ago

We have made a change to the Grambank coding workflow so that it is possible to code the binarised version of multistate features directly at the input. This means that in the next release, we will have two kinds of binarised features

a) values that have been derived from multistate features, through a script like the one linked above (the majority of coding) b) values that have been coded in a binary fashion already at the start by a coder

(b) values will be better in quality than (a) because they can accommodate uncertainty and absence in a better way.

There are several ways we can go ahead here, the path of least resistance to me seems to be:

1) check if raw binarised coding exists, if so use that and ignore the multistate equivalents 2) if (1) isn't available, do the binarisation as in the r-script before.

This would result in a bit of a mixed bag of values, some of higher quality than others. If we prefer them to be the same quality, we should just use the values as derived from the multistate features.

SimonGreenhill commented 1 year ago

How will these differences be implemented in grambank itself? will the values of type "b" have different labels? (e.g. GBXXbinary vs GBXX).

Rather than having both systems hanging around it would be better if we can transition to just using the best one (b) and letting the 'binarisation' code die. Can we do this, or do we need to be able to do (a) still?

HedvigS commented 1 year ago

How will these differences be implemented in grambank itself? will the values of type "b" have different labels? (e.g. GBXXbinary vs GBXX).

Right now the way I wrote them in the binarise function #12 , they are named the same. It's up to the user to choose wether to mix them or only used the ones derived from multistate.

Rather than having both systems hanging around it would be better if we can transition to just using the best one (b) and letting the 'binarisation' code die. Can we do this, or do we need to be able to do (a) still?

In grambank v1.0 we have 12,594 datapoints that are coded in the multistate version of the features. If we in future want to use these in a binarised way, we need to have a function like I suggest here. RG decided against going back and recoding these. We are only coding binarised features from scratch for languages from now on. If you want to reverse that decision, please message RG.

HedvigS commented 1 year ago

Please note as well that this function assumes the same as what we did in the release paper, i.e that GB024:1 -> GB024a:1 & GB024b:0. Absence is assumed for GB024b, not ?.