grambank / pygrambank

Apache License 2.0
4 stars 1 forks source link

how to treat ? in meta data json #116

Closed HedvigS closed 11 months ago

HedvigS commented 11 months ago

In grambank, we code ? when we have tried to code, but the description is not enough (either the phenomena is not described at all, or it is described but not in enough detail that we can make a call). This makes ? different from "pure" missing data. Sometimes, it is interesting for us to differentiate the two. This is for example relevant for the gramgaps-project.

The current treatment of ? in the meta-data json of grambank as CLDF specifies ? as a type of NULL. I think this may be problematic. If for example, @SimonGreenhill implements adjustments in rcldf that looks at the types in the json and converts ? to NULL , we'd lose this distinction.

What are the alternatives here?

xrotwang commented 11 months ago

See https://github.com/grambank/rgrambank/pull/36#discussion_r1420768884

HedvigS commented 11 months ago

If for example, @SimonGreenhillhttps://github.com/SimonGreenhill/rcldf/issues/14, we'd lose this distinction.

UPDATE:

If for example, @SimonGreenhillhttps://github.com/SimonGreenhill/rcldf/issues/14, we have the distinction no row = "pure" missing data NULL value = ?coding

I personally think this distinction is more confusing than keeping? as ?. I don't think ? should be turned into NULL/NA

xrotwang commented 11 months ago

See https://github.com/grambank/rgrambank/pull/36#discussion_r1420826064

xrotwang commented 11 months ago

FWIW D-PLACE does include pseudo-categories MISSING DATA FOR VARIABLE X. But they are - well - pseudo-categories, which I find harder to explain and make transparent than null values.

HedvigS commented 11 months ago

The Grambank dataset was defined with ? as a particular category different from missing. The scripts we wrote in grambank-analysed has this is mind, as has code ongoing on gramgaps. I think that given the kinds of manipulations that are common, such as making wide, it is good to make the user explicitly aware of the difference between them ? and missing row in the long table ValueTable. It should be a conscious decision the user is confronted with, and turning ? into NA/NULL I think obfuscates that.

I understand @xrotwang 's concern from a CS database design and CLDF, I do. However, Grambank is an ongoing project that has particular codes that differentiates ? and missing row in such a way that it would be confusing for users who are already interacting with Grambank to turn ? into NA/NULL.

xrotwang commented 11 months ago

@HedvigS you are free to interpret the ? in the CSV however you want - i.e. disregard that the CLDF metadata says it's null. But I'm strongly against removing the designation of ? as null in the metadata. ? is no well-defined category, and even less some sort of valid value for binary variables.

HedvigS commented 11 months ago

@HedvigS you are free to interpret the ? in the CSV however you want - i.e. disregard that the CLDF metadata says it's null. But I'm strongly against removing the designation of ? as null in the metadata. ? is no well-defined category, and even less some sort of valid value for binary variables.

If ? as null in the metadata is still the case for the next grambank release, then packages like rcldf will (hopefully) respect that when making the tables and ? will turn into NA. For some grambank purposes, this will get confusing. We'll have to write silly things like "if NA/NULL in longtable, turn into ? after rcldf::cldf has read in the dataset and rendered tables. To me, that's not very nice - especially considering that ?has been a defined category since the start of Grambank.

Ever since the stabilisation of the grambank questionnaire, we have explicitly throughout the entire data-gathering process treated ? as a value for any Grambank feature that means "attempted to answer, but this/these specific description(s) in the source field didn't contain the information/enough information to make a call". I don't know if that is well-defined enough for you @xrotwang , but that is the way it's worked since the inception of Grambank.

xrotwang commented 11 months ago

It's not well-defined, because it doesn't mean that all languages with a ? for feature X have anything in common. Thus, it is not a category like "Language has Y" or "Language has no Y".

So, I'd argue that obfuscation happens, when ? looks like just another category, and isn't explicitly labeled as "cannot be used for analysis without conscious decision". The way to explicitly label data as "cannot be used for analysis without decision" is NA, or null or None - whatever this concept is called in the computing environment.

Making long data wide is exactly the type of operation, where such a decision must be made. So reintroducing the ? as category at this point would seem absolutely appropriate if that makes sense.

HedvigS commented 11 months ago

If that is the criteria for well-defined, then I agree it is not. If the criteria is "has a consistent definition", it has a definition. I couldn't tell if you knew the definition.

Because ? is not a value like 1, 0 etc that is exactly why in the grambank-analysed code it does get turned into NA/NULL for many situations where it is appropriate.

What I'm trying to say is that if the situation remains as is, there'll be some changes to already existing behaviour and that can cause confusion. If we stick with ? as NULL in meta-data and rcldf learns to take this into account, we'll have to tell existing users about this explicitly. It should be fine, I think I know everyone for whom this'd matter but since it was news to me it seemed to me that the most conservative thing would be to not change this assumption.

I'll get going on changes over in gramgaps regarding this so we don't get caught with rgrambank changes.

xrotwang commented 11 months ago

Yes, that is the criterion for well-defined. That's the whole point of standards like CLDF: Assumptions should be made explicit and enforceable. The assumption of data for categorical variables is that each datapoint with a value for the variable is tied to one category, and categories partition the objects under study into meaningful groups.

With a pseudo-category represented by the value ? no CLDF tool would be able to infer that most of the Grambank features are binary.

HedvigS commented 11 months ago

Right, okay. That is good to know, I did not know that. Everytime we deconstruct something like this, I learn something more about the assumptions that go into CLDF. I assumed that the codes should be values that occur that have a consistent definition. To me, that didn't necessitate that the values should be of the same kind. I thought we could have a definition in the meta-data that made it clear what the definition is and cldf-tools would be able to understand.

Relatedly, I've noticed that some new users to grambank don't treat the multi-value features correctly, thinking that 1, 2 and 3 are equally different to each other. This isn't the case, 3 is "both" 1 and 2. I don't know if that also violates CLDF-assumptions about values.

xrotwang commented 11 months ago

I think, categories for values of the same feature being "of the same kind" isn't a particular CLDF assumption. It's how categorical variables are commonly understood. CLDF just provides the framework to model data to support this assumption - e.g. by having a way to flag datapoints with NA values rather than co-opting categories for an orthogonal aspect.

CLDF tools can only make sense of whatever is machine readable. So they can know which values are listed as meaning null. But they can't read category descriptions and infer whether one category is special - and they don't interpret informal conventions like ?.

The "1,2,3" categories may not be easy to use, because of some internal dependency. But if 1 means "1 but not 2" and 2 means "2 but not 1", they seem to be well-defined, i.e. looking at a datapoint, it isn't ambiguous which bin it should go in. I'm not aware of formal specifications for such dependencies, though, or tools that would make use of such specs to determine analysis behaviour.

HedvigS commented 11 months ago

I think, categories for values of the same feature being "of the same kind" isn't a particular CLDF assumption. It's how categorical variables are commonly understood.

Hm, okay. Since we from the start defined ? as a value somewhat "alongside" 0, 1,2etc in the data-gathering process that's not something I thought of in that way. For most analysis, it's true I treat ? as different from the others of course and convert it to NA when necessary, but I also manipulate the data to take into account the special nature of 3 and 0 for multistate features. If all is on the table, then I can make the calls. If you think that it is better to treat ? as NULL in the meta-data json and therefore possible "merge" missing rows and ? in some circumstancs then let's just continue to do that. I'll make the adjustments where I need to, and I hope it'll be easy for other users to comprehend.

CLDF just provides the framework to model data to support this assumption - e.g. by having a way to flag datapoints with NA values rather than co-opting categories for an orthogonal aspect.

CLDF tools can only make sense of whatever is machine readable. So they can know which values are listed as meaning null. But they can't read category descriptions and infer whether one category is special - and they don't interpret informal conventions like ?.

The "1,2,3" categories may not be easy to use, because of some internal dependency. But if 1 means "1 but not 2" and 2 means "2 but not 1", they seem to be well-defined, i.e. looking at a datapoint, it isn't ambiguous which bin it should go in. I'm not aware of formal specifications for such dependencies, though, or tools that would make use of such specs to determine analysis behaviour.

True! Still, I've seen people treat 3 as a value similarly differently from 1 as 1 is to 2. This is not the case.

In grambank-analysed scripts and in rgrambank we have explicit procedures for dealing with multistate features and making them into a type of binary sets. In grambank-analysed scripts it's done here: https://github.com/grambank/grambank-analysed/blob/main/R_grambank/make_wide_binarized.R and in rgrambank, we've got the function https://github.com/grambank/rgrambank/blob/main/R/binarise.R that does the same job.

I know that similar feature value dependencies also occur in APiCS and WALS. I don't know what happens there with cldf-tools.

xrotwang commented 11 months ago

I know that similar feature value dependencies also occur in APiCS and WALS. I don't know what happens there with cldf-tools.

As I said, nothing happens there, because CLDF tools rely on what's formally specified in the metadata and I'm not aware of a standard to formally specify dependencies between values of variables.

HedvigS commented 11 months ago

I know that similar feature value dependencies also occur in APiCS and WALS. I don't know what happens there with cldf-tools.

As I said, nothing happens there, because CLDF tools rely on what's formally specified in the metadata and I'm not aware of a standard to formally specify dependencies between values of variables.

Okay. That is good to know. I don't use most CLDF-tools so I haven't encountered this issue, but in working on my own with these data-sets it's something I consider and make adjustments for.

xrotwang commented 11 months ago

As I said earlier, the point of a specification like CLDF is making common, shared assumptions explicit and enforceable. If assumptions about dependency between values become more common, and there's a shared understanding what they are and how tools should treat these, we might have to think about a formal way to specify these, so they can become part of CLDF.

As far as I can tell, none of this has happened yet. So people using CLDF tools will have to use the same out-of-band, informal information that you use now to adjust data after having read it from CLDF datasets.

HedvigS commented 11 months ago

As I said earlier, the point of a specification like CLDF is making common, shared assumptions explicit and enforceable. If assumptions about dependency between values become more common, and there's a shared understanding what they are and how tools should treat these, we might have to think about a formal way to specify these, so they can become part of CLDF.

As far as I can tell, none of this has happened yet. So people using CLDF tools will have to use the same out-of-band, informal information that you use now to adjust data after having read it from CLDF datasets.

Like I've said before, I don't know who the current or intended user base is for most CLDF-tools so i have limited insight. It seems to me to mainly be related to specific circumstances around lexical data, so I guess this wouldn't occur there. Most typologists I know tend to build their own code, in python/perl/R. I don't have insight into who uses datasets with value depdencies and CLDF-tools, besides the experience I personally have sometimes.

Regardless, for Grambank meta-data I'm bending my will to yours @xrotwang . We continue as we have before, and I'll adjust where I need to given the information I have recently aquired.