Proposal: create a better object for categorical data

djvanderlaan commented 6 years ago

In statistical offices we often work with various classifications. The default factor variable in R has only very limited capabilities. We could try to create something better:

Classifications often have codes and labels (think business classifications; regional classifications). It would be nice to keep those together and be able to switch between the two.
Classifications are often hierarchical. Factors don't really allow for this. It would be nice of one would for example be able to switch between different levels. E.g. when aggregating. Perhaps this would also be of use for methods such as disclosure control.
Translations. Be able to choose between different translations of categories, e.g. for internal and external output.

Concerning the hierarchical classifications: it would also be nice if aggregates could be calculated at different levels as is the case in statistical output tables. With tools such as dplyr this is now quite cumbersome. Perhaps a different group_by.

djvanderlaan commented 6 years ago

I once did have a start on this: https://github.com/djvanderlaan/categorical and I know @edwindj also did try something.

markvanderloo commented 6 years ago

I also gave it a shot once. Lets add it to the list!

We can start discussing some ideas here already

markvanderloo commented 6 years ago

Regarding aggregation. A hierarchical categorical variable is basically a DAG. So for aggregation I think we need convenient ways to specify the nodes that define the aggregation subset. Since hierarchies are unbalanced trees this is probably more suited than just allowing for definition of a 'level' in the tree (which is closer to a group_by)

We probably want both, but the level-based aggregation can be built on top of the ability to define individual nodes of aggregation (hope I'm still clear).

djvanderlaan commented 6 years ago

I think, I understand what you mean. Storing the classification as a DAG seems to be most general. Although I can also think of situations where a DAG is a bit weird, e.g. in the case of translations.

The advantage of levels is that it makes it, probably easier for the user to specify what he wants. But that doesn't mean we can store the internal information as a DAG, as you mentioned.

For reference: the method I used in my attempt is storing it as a data.frame, e.g.:

"Dutch"  "Native Dutch"  "Native Dutch"
"Dutch"  "Non-Western"   "Turkish"
"Dutch"  "Non-Western"   "Moroccan"
"Dutch"  "Non-Western"   "Other"
"Dutch"  "Western"       "Western"

This makes it also easy to have non-hierarchical 'levels':

Municipality Province SafetyRegion
A               X             U
B               X             V
C               Y             V
D               Y             W

And custom levels (perhaps someone wants to focus on a specific group:

"Dutch"  "Native Dutch"  "Native Dutch"  "Non-western"
"Dutch"  "Non-Western"   "Turkish"  "Non-western"
"Dutch"  "Non-Western"   "Moroccan" "Non-Western"
"Dutch"  "Non-Western"   "Other" "Non-Western"
"Dutch"  "Western"       "Western" "Western"

djvanderlaan commented 6 years ago

I'm not sure 'translations' are strictly DAG's: A is a translation of B, but that makes B also a translation of A.

markvanderloo commented 6 years ago

How would you handle unbalanced hierarchies then? e.g. the NACE classification is sometimes specified up to 2, sometimes up to 3, 4 or (in SBI) to 5 digits. A data.frame does not seem a logical way to store this (you could use NA for unused levels of course)
Also,I think I am not sure what you mean with translations. You mean between natural languages?

markvanderloo commented 6 years ago

Another thing to think about is how to handle multiple hierarchical variables. I haven't given that any thought yet (just using this issue as a notebook really).

djvanderlaan commented 6 years ago

I am not saying that this is the most suitable data format. It is probably easiest for users to supply (and read) the data in this format. But I can imagine that graphs are better for internal storage. One way of handling this is indeed NA's, or as in my example with etnicity: repeat the labels.

Concerning the translations, yes, and also between codes and labels. I run into this when working with hospital data. People used to working with the data work with the codes, especially when selecting etc, as they are more compact and easier to enter. For the output to the hospitals we use the dutch labels. For the methodology description, which is in english we use the english labels.

djvanderlaan commented 6 years ago

What do you mean with 'multiple hierarchical variables'?

markvanderloo commented 6 years ago

Ah, that clears it up. And I totally agree that specifying a hierarchy is probably easiest via a data frame (didn't think of the user interface yet). For computing I'm not sure. Maybe it is actually -- handling tables is easier than handling graphs. OTOH I can imagine that there are some fundamentally useful operations that would require a DAG or tree representation. Perhaps we should come up with a list of operations we'd like to support.

With multiple variables I mean multiple columns with hierarchical variables. How to specify frequency counts when crossing them for instance. Its probably easy, but I just noted it because I haven't even considered if that's going to make things more difficult or not.

djvanderlaan commented 6 years ago

Something like triplets might work:

A "is the english label of" B
C "is the dutch label of" B
B "is part of" D

A lot of operations can then be written as joins and filters.

Concerning the multiple hierarchies: especially when they are interdependent, this can be a pain. One example is 'etnicity' as defined above and 'generation' e.g. 2nd generation Turkish, but you can't have '2nd generation native Dutch'

But I agree, perhaps first look at the type of operations you would want, and then decide on how to implement.

edwindj commented 6 years ago

I am in for developing such a package (my first try was https://github.com/edwindj/category)

markvanderloo commented 6 years ago

This may be interesting summer reading: Towards a general theory of classifications

edwindj commented 6 years ago

Regarding your discussion on DAG and data.frames: I am taking a middle ground. I was involved in the development of a DAG model for classifications (Crystal), which stored hierarchical structure as a POS and profided means to store time dependent versions of a hierarchy (by allowing hierarchy to reuse the nodes same underlying POS). I like such a model, because it allows to capture the concept of hierarchies sharing nodes.

However:

Such a model is reasonable complex (both in interface, as well in implementation). I would like the implementation, if there was also a simplified version for the simple cases:
Most use cases (let's say 80%) are a simple classificaion tree: a simple tree interface would be nice!

Wish list:

Conversion from/to Data.frame with categories with different levels columns (as described by Jan), e.g. "Country", "Municipality"
Conversion from/to Parent child data.frame (tree like entry) e.g "NL", "Amsterfdam"
Conversion to/from igraph
Conversion to/from yaml / xml / json?
Hierarchical aggregation functionality
Extracting levels from a classification (as a factor)
@markvanderloo do you happen to have a paper copy of that paper?

markvanderloo commented 6 years ago

nope. its a book. I'll order it.

bogdanoancea commented 6 years ago

nu puteam sa tinem intr-unul din amfiteatrele inchiriate de la titulescu? si in salile de seminar de acolo?

On Fri, May 25, 2018 at 11:29 AM, Mark van der Loo <notifications@github.com

wrote:

nope. its a book. I'll order it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uRos2018/unconfUROS/issues/2#issuecomment-391981353, or mute the thread https://github.com/notifications/unsubscribe-auth/AI4eeubw6jZRV3l6zIFhVSH2zTdgEBY5ks5t18DogaJpZM4ULt0m .

djvanderlaan commented 6 years ago

For reference: ran into package that might be of interest, although it doesn't solve the stuff we were talking about above: https://github.com/tidyverse/forcats

RLesur commented 6 years ago

Following the discussion on DAG, the implementation could be compliant with XKOS (an SKOS extension for modeling statistical classifications) edited by @FranckCo There are some illustrations with NACE; I think XKOS could be a valuable input.

markvanderloo commented 6 years ago

Thanks for the reference!

It would be a good idea to somehow connect to this standard. e.g. to make it possible to parse it into R. Maybe an XKOS parser would be a nice as a separate package? Would you like to join the unconf :-) ?

I've looked at some examples. I was surprised to see its html rather than xml. Good idea actually, that makes it human-readable by just opening it in a browser. nice!

RLesur commented 6 years ago

There are also some examples expressed with Turtle that can be parsed with the rdflib :package: for instance (not tried yet).

I'd like to join the unconf, but my administration has a limited budget. So, I'll only attend the conf. However, I'll follow the unconf output.

I also developed a package that I use to manage classifications (not sure that it is the best input for the unconf): https://github.com/RLesur/casewhen

dpprdan commented 6 years ago

A little late, because I know you have already started today, but the following might be worth a look if you don't know it already:

Re labels there is the sjlabelled package by @strengejacke. In addition, @njtierney just recently added support for special missing values to naniar.

Re translations Stata's label language might also be of interest.

uRosConf / unconfUROS2018

Proposal: create a better object for categorical data #2