Closed toddfarmer closed 7 years ago
Note: Comment by Wes McKinney (wesm): A couple more notes on this:
While creating the Feather format, which utilizes Arrow for much of its memory layout, we (Hadley Wickham and I) ran into this limitation in the current draft of Arrow metadata (https://github.com/apache/arrow/blob/master/format/Message.fbs).
It would be great to reconcile this need to make progress toward a canonical metadata.
This data type also has the benefit of reducing memory usage for arrays (e.g. with string logical type) containing many duplicate values.
Note: Comment by Julian Hyde (julianhyde): Since Arrow is a general-purpose data format, this requirement seems to me to be too closely targeted at a particular problem domain.
To illustrate, consider another domain, OLAP, where a dimension has a key, a name, a caption (localized name), a localized description, an order key and perhaps user-defined properties. I'm not claiming that OLAP dimensions are the "right" model either.
I suspect that the "right" model is to allow additional attributes in the dictionary (in addition to the single "value" attribute at present). By convention, there would be one or more attribute names that define a category/factor when Python or R reads the dictionary.
Note: Comment by Mohit Jaggi (mohitjaggi): When I was working on feature engineering earlier I struggled with this question too. My conclusion was to let the semantic of "category" be left for interpretation by a higher layer (typically feature engineering or machine learning). In "raw" data a category might be represented in several ways (string, boolean=one hot encoding, number etc) anyway so supporting this at a lower layer would impose constraints on the "raw" data. And then whose responsibility will it be to "prepare" the data to satisfy this constraint? Moreover, concepts like "ordered" are also fuzzy. A set of categories may be unordered for machine learning code but may be ordered for display in a UI. If Arrow is below both layers then this would be confusing.
Note: Comment by Wes McKinney (wesm): There is no doubt that a Category logical type / metadata is necessary for many use cases (because it is semantically distinct from dictionary-encoded data, even though the physical representation is the same). For example: statistics and machine learning users from many communities would not be able to faithfully round trip data to Arrow metadata without it. I will ask others to give their perspective on this if you would like to hear from others.
The implementation (physical representation) of Category is the open question. I would propose for it to be a dictionary-encoded struct with a single child. For example:
Category[string] -> Struct<levels: String>
The additional metadata requirement is orderedness. This needs to be stored in the schema as it needs to be a part of schema negotiation (rather than only observed in the realization of the data in the dictionary).
By using dictionary encoding for the implementation, one can also easily share dictionaries used by multiple fields (having the same category/factor levels).
Note: Comment by Wes McKinney (wesm): The other question I have is the category / dictionary indices. For small cardinality levels it is not uncommon to store the indices in a smaller integer (like int8 / uint8).
Note: Comment by Micah Kornfield (emkornfield@gmail.com): For the structure would it pay to not have a specific metadata type?
At an abstract level a category type indicates a variable that can take on a fixed number of values. Sometimes these values have mnemonic/semantic meanings to them. I think it is equally valid to have a categorical type, represented by a string vector, and communicate them as fixed types in metadata. this leads me to think that categorical information should be additional metadata on the Field table in Message.fbs. We might want to consider two ways of doing this:
table IntCategoryList { // specify the universe of values as a list of integers (non dictionary encoded)
value_universe: [int]
}
table StringCategoryList { // specify the universe of value via a list of strings (non dictionary encoded)
value_universe: [string]
}
table DictionaryCategory { // the universe of values is provided via a dictionary
}
union CategoryUniverseDescription {
IntCategoryList,
StringCategoryList,
DictionaryCategory
}
table CategoryInfo {
ordered: Boolean
category_universe: CategoryUniverseDescription
}
this could be simplified via to always assume factors are dictionary encoded (names are not well though out either). But in either case we could add a optional category member to the Field table of type CategoryInfo.
Regarding the indexing, I would vote to stick with int32s for V1. Sizing the integer type and other more advanced features like bitweaving can be left for V2.
Note: Comment by Hadley Wickham (hadley): I agree with Wes that factors/categories are a fundamental data type - yes, they share similarities with labelled integers or strings constrained to a set of specified values, but they are an important fundamental data type for data analysis. Yes, factors/categories are some what domain specific, but given that the tagline of Arrow is "Powering Columnar In-Memory Analytics", I think this is the domain that arrow should be concerning itself with.
Note: Comment by Jacques Nadeau (jnadeau): Can you guys provide two small example datasets in JSON format here?
Note: Comment by Wes McKinney (wesm): Here's a couple examples from Python and R (I believe SAS / Stata / SPSS / Julia all have similar concepts)
Python (pandas)
In [1]: import pandas as pd
In [2]: s = pd.Series(pd.Categorical.from_array(['foo', 'bar', 'foo', 'bar']))
In [3]: s
Out[3]:
0 foo
1 bar
2 foo
3 bar
dtype: category
Categories (2, object): [bar, foo]
In [4]: s.dtype
Out[4]: category
In [5]: s.cat.codes
Out[5]:
0 1
1 0
2 1
3 0
dtype: int8
In [6]: s.cat.categories
Out[6]: Index(['bar', 'foo'], dtype='object')
(take note of the int8 storage class...)
In Python, the categories can be any data type that is valid in other pandas contexts:
In [9]: s = pd.Series(pd.Categorical.from_array(pd.date_range('2000-01-01', periods=5).repeat(2
...: )))
In [10]: s
Out[10]:
0 2000-01-01
1 2000-01-01
2 2000-01-02
3 2000-01-02
4 2000-01-03
5 2000-01-03
6 2000-01-04
7 2000-01-04
8 2000-01-05
9 2000-01-05
dtype: category
Categories (5, datetime64[ns]): [2000-01-01, 2000-01-02, 2000-01-03, 2000-01-04, 2000-01-05]
In [11]: s = pd.Series(pd.Categorical.from_array([(100, 1000), (0, 100)] * 4))
In [12]: s
Out[12]:
0 (100, 1000)
1 (0, 100)
2 (100, 1000)
3 (0, 100)
4 (100, 1000)
5 (0, 100)
6 (100, 1000)
7 (0, 100)
dtype: category
Categories (2, object): [(0, 100), (100, 1000)]
In R, the category values are constrained to be strings:
> f1 <- factor(c("foo", "bar", "foo", "bar"))
> f1
[1] foo bar foo bar
Levels: bar foo
> as.integer(f1)
[1] 2 1 2 1
> levels(f1)
[1] "bar" "foo"
> f2 <- factor(c("foo", "bar", "foo", "bar"), levels=c("foo", "bar"), ordered=T)
> f2
[1] foo bar foo bar
Levels: foo < bar
> levels(f2)
[1] "foo" "bar"
If the categories have ordering indicated, these can be used automatically in different modeling contexts (for example, in a multinomial logistic regression: https://en.wikipedia.org/wiki/Multinomial_logistic_regression)
It's hard to estimate exactly, but based on the data we have (maybe Hadley / RStudio has a better estimate) it suggests that these two communities alone represent several million users worldwide.
To [~emkornfield@gmail.com] comment, there's a couple of things that come immediately to mind:
The category "levels" in general need to be able to accommodate any logical type. For example, we could use Struct<lower: Float64, upper: Float64>
to represent numerical intervals (e.g. the result of a histogram operation)
"High cardinality" categories frequently occur in the wild, so if the categories are part of the schema, then the schema could be arbitrarily large (up to the 2GB limit in leaf nodes). This will eventually cause problems in schema negotiation.
Analytics involving categorical data may apply transformations to the categories (combining similar categories, reordering, etc.). It would seem more parsimonious to me to write a new dictionary and change the dictionary ID in the schema versus generating a new schema with the new dictionary
Dictionaries / categories may be shared by multiple columns.
A JSON representation of this type might look like
{codes: [0, 0, 0, 0, 1, 1, 1, 1], levels: ['foo', 'bar'], ordered: true}
Note: Comment by Micah Kornfield (emkornfield@gmail.com): Thanks @wesm, I was going to post that I think dictionary encoding is the way to go for these types of columns everything else just adds complications. I apologize for adding noise in that respect.
On that note I would like to take a step back and ask the question implicit on my last post. Should Category be its own first class type, or should it be communicated via a metadata side channel?
I lean towards the latter because it allows for compliant implementations to still handle Categorical values without having to write additional code to handle them explicitly (i.e. if its a Category[UTF16], I need to know how to convert this to something my implementation supports. However, if its just dictionary encoded Utf16 that happens to be a categorical variable, then the implementation can handle fine, and for systems that care about explicit categorical values, they can inspect the metadata and treat it normally.
One other question in regards to one of your points: I assumed Schemas were immutable, does that match your understanding as well?
Note: Comment by Wes McKinney (wesm): We're running into what "first class type" means again. I'm going to change the JIRA title to "Add a Category logical type" to be more clear (I don't think any changes to Layout.md are necessary).
My preferred representation of Category would be as a dictionary-encoded Struct. This has the benefit of allowing systems that don't know about what a Category is to still manipulate / examine the data normally. In other words, if we had the data:
Category[String]
codes: [0, 0, 0, 0, 1, 1, 1, 1]
categories: ['foo', 'bar']
Then the Arrow representation would be as
Struct<levels: String> dictionary-encoded
dictionary indices: [0, 0, 0, 0, 1, 1, 1, 1]
dictionary_id: i
dictionary i:
type=Struct<levels: String>
fields:
levels (type=[String]) : ['foo', 'bar']
Any other ideas for this?
I could have been more clear about my point about the schema – if the categories are embedded in the metadata, then generating a new Schema after a transformation could be arbitrarily expensive. In theory the size in-memory of the Schema should be small, so that modifications (yielding new schemas, due to schema object immutability) are cheap.
Note: Comment by Micah Kornfield (emkornfield@gmail.com): Yeah, sorry, I should be more consistent with terminology. In this case I meant a logical type (i.e. new specific type in Message.fbs). For the physical layout s the desire to represent this as a Struct to allow extensibility with addition data in the enumeration (per Julien's suggestion) or for some other reason? It seems to be easier for system non aware of categorical types to interact with the type if we don't add the extra level of nesting.
I was imagining something more in the order of the below for a categorical string (like R). The general idea is you can have a categorical variable of an arbitrary type, but don't add the extra nesting in the structure proposed above. A Column's categoricalness becomes and extra piece of metadata on the field.
String: dictionary-encoded
dictionary indices: [0, 0, 0, 0, 1, 1, 1, 1]
dictionary_id: i
// new member
Categorical_type: Ordered|Unordered|None
dictionary i:
type=[String]
values= ['foo', 'bar']
The new member above could either be a specific explicitly modeled flatbuffer element, or we can create a general extension for key-value pairs on fields. Another example, of where a key-value pair field might be useful is to pass along metadata about list types, indicating that they are sorted/and or contain contain unique values.
Note: Comment by Wes McKinney (wesm): [~emkornfield@gmail.com] what you've proposed has the nice property that a system without any Category-specific code could treat it as simple dictionary-encoded data. This seems OK to me, if adding this field does not offend others' sensibilities. We could make it more general as dictionary metadata (to avoid having to add more attributes to the Field table should we want to add more interpretations / metadata about the dictionary)
[~julienledem]
curious what you think on these proposals?
Note: Comment by Julien Le Dem (julienledem):
[~wesmckinn]
and [~emkornfield@gmail.com]: the proposal of adding a categorical_type field to Field sounds good to me.
when it is set then it is required that the field is dictionary encoded.
Note: Comment by Micah Kornfield (emkornfield@gmail.com): Do we want to make an exception for requiring dictionary encoding for Categorical[ints]?
Note: Comment by Julien Le Dem (julienledem):
ints can also be dictionary encoded (benefits compactness or speed if performing aggregation for example as the ids are contiguous).
So the exception would be only if it's ints and all the values between 0 and n ? is this a common use case? [~wesmckinn]
[~hadley]
?
Note: Comment by Wes McKinney (wesm): You could have a case of redundancy if the span of dictionary indices and dictionary values is the same. In practice, dictionary encoding integers with a wide span (for example – if max(values) - min(values) is some large number) can have significant performance benefits as you can do things normally requiring a hash table (e.g. computing a frequency table) with much less effort.
Note: Comment by Wes McKinney (wesm): PR, finally: https://github.com/apache/arrow/pull/297
Note: Comment by Wes McKinney (wesm): Issue resolved by pull request 297 https://github.com/apache/arrow/pull/297
Note: This issue was originally created as ARROW-81. Please see the migration documentation for further details.
Original Issue Description:
A Category (or "factor") is a dictionary-encoded array whose dictionary has semantic meaning. The data consists of
Category/factor types are used in a number of common statistical analyses. See, for example, http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a basic requirement for Python and R, at least, as Arrow C++ consumers, to have this type. Separately, we should consider what is necessary to be able to transmit category data in IPCs – possible an expansion of the Arrow format.