More general dimension labeling functionality

metasoarous commented 9 years ago

The goal is to approach R's dimension naming capabilities. Vectors, matrices, and data frames all support naming. This can be quite useful for non-numerical indexing.

Started this issue on the incanter repo, and now moving here. This was @mikera's response:

we now have protocols in core.matrix that support labelling along any dimension - see clojure.core.matrix.protocols/PDimensionLabels and associated functionality.

So I think this functionality can be build in core.matrix and used directly from Incanter. What is needed I think:

Some rationalisation of the protocols (probably PDimensionImplementation can be deprecated in favour of PDimensionLabels, for example)

One or more "wrapper" classes that can add labels to an arbitrary array

Label support added to the main Dataset implementation (in clojure.core.matrix.impl.dataset) - currently this is quite a primitive implementation and only supports column names

Also see issue #193, equivalent to the third bullet above.

metasoarous commented 9 years ago

OK; starting to get a sense for where this is going, but have a few questions/comments.

For the dataset implementation, should we add a [rownames colnames data] arity? Or use :rownames, :colnames keyword arguments instead? I prefer the latter (passing nil if you want rownames but not colnames feels weird to me), but realize it would be a breaking change so am happy to defer.
How do we deal with selecting by both label names and positional indices? I think having both is valuable, and there are a few ways I could see this being done:
- Separate functions for selecting by indices versus labels: simple solution, but a bit pedantic.
- R data.frames don't allow you to provide numeric names for your columns/rows; If you try, it turns them into strings. Doing something similar here (casting as strings, or erroring when numerics are passed as names) would simplify things, since you could use the same functions and dispatch on type.
- Given numeric indices we could see if the indices are in the name collection first, and use those indices if they are. If not, try to use them as though they are the indices.
The current implementation of select-columns uses .indexOf to find the index. This has linear computational complexity, which for large data sets can become problematic if you do a lot of indexing (this turned out to be a big problem in my application). An alternative would be to use an array-map (of name -> index) instead. This gives quick access to both an ordered rownames and positional indices given label names. However, this comes at the cost of increased housekeeping when removing columns/rows or creating submatrices. So, what do you think the right approach is here? Should use of an array-map be the default implementation, a secondary implementation, or an optional wrapper?
As for wrapping arbitrary arrays with dimension labels, I may have a clever idea for how to do this. I'll tinker with it a bit and let you know how it turns out.
Definitely agree that PDimensionImplementation can be deprecated.

mikera commented 9 years ago

It may be that we want to support higher dimensional datasets - so you can have labels on any dimensions (not just rows and columns which are dimensions 0 and 1 of a 2D matrix). Worth thinking about this option, it could be either the same of different from the "wrapper" solution.

On the labels vs. positional lookup:

I think we need to maintain support for positional indices. That is needed to be consistent with how the rest of core.matrix works
Agree a indexed collection / map makes sense. Probably a map of name -> positional index. Actually some sort of fast bidirectional lookup might be helpful.
I think that "try as name first, if not found then try as positional index (if it is a number), else fail with error" is the right strategy for selecting by name
Functions that allow selecting by label need to be separate API functions - there is going to be quite a performance hit on labelled lookup so we can't really afford that to be in the fast path of the main API functions (mget for example needs to be really fast)

mars0i commented 9 years ago

I think that "try as name first, if not found then try as positional index (if it is a number), else fail with error" is the right strategy for selecting by name

I agree with everything else, but I'm not sure whether I like this behavior. If I know that I have an index, I'd use mget or something that requires indexes. If I use a function that expects labels, why would I expect it to turn a number into an index? Maybe there should just be an error if a label isn't found. This is one of those areas where there's a tradeoff between flexibility and being conducive to bugs. In Clojure, we tend to go for flexibility, and I prefer that, but I'm wondering whether in this case the benefits of flexibility outweigh the costs.

mikera commented 9 years ago

My main rationale for this is the ability to mix indices and labels for different dimensions. Something like (select dataset 20 "Population") or something like that

mars0i commented 9 years ago

I see the utility of that. OK.

metasoarous commented 9 years ago

Groovy.

A separate API for label/name specific functions sounds good. Perhaps core.matrix.labels for the API and core.matrix.labels.impl for the implementation work?

As for higher dimensional datasets, it seems like there are two ways to go:

Given the current implementation, the most natural thing to do is have the extra dimensions (cue string theory jokes) be "curled up" inside these arrays. The 0th dimension of these arrays would correspond to row, while the ith dimension (i > 0) would correpond to the (i+1)th dimension of the dataset (since columns - the 1th dimension - correspond to the columns vector in DataSet's implementation).
The simpler alternative would be to implement N dimensional datasets as N dimensional arrays in a labels wrapper (and perhaps some additional API/semantics).

I think that while the first implementation is wonkier, it may ultimately be a better approach, as it provides more flexibility as to the array types being used for the columns (not all array types would allow for column data of varying types). I'm happy to go either way on this though, and welcome your thoughts.

mars0i commented 9 years ago

fwiw, R has separate dataframes (2D, with only column names doing any work), matrices (2D, with optional column and row names), and arrays (N-dimensional, again with optional names along any dimension). So there's a precedent for splitting data structures into a standard 2D structure (dataframes, matrices) and another N-dimensional kind of structure. On the other hand, this distinction seems very inelegant, and probably just has to do with the historical development of R. (Annoying, too, sometimes, because there are operations that are only designed for dataframes, and you have to create a new dataframe to use them--merely taking slices or averages of an array won't work.)

mikera commented 9 years ago

I'd really like to avoid having completely different APIs. Most API functions should be the same, and the few specialised API functions can live in the same namespaces just fine.

For the label-aware selecting, it may be sensible to extends the existing API in clojure.core.matrix.select, which offers more general selection functionality and could probably be made label-aware.

As for underlying implementations ("curled up" representation, etc.) - I think it is find to have different kinds of implementations / wrappers. The API should remain the same however.

metasoarous commented 9 years ago

OK. I misunderstood what you meant up there in your last bullet a few messages back. I get you now though; Sounds good.

When the time comes I may ping you about which things should have separate functions, and which can be the same.

metasoarous commented 9 years ago

I realized that my suggestion of using array-map for fast bidirectional lookup was flawed, which is funny because I came to that conclusion before when developing towards an application specific version of this. I forgot that lookup on that type grows linearly, so that's obviously no good...

Anyway, I've implemented a LabelIndex type to address this problem. It's a little more specific to the particular use case here then a general purpose fast, bidirectional lookup type would be (in particular, assumes the things being mapped to on the "right" are numerical indices...), but I think it'll fit the bill. Not sure what the final semantics will look like yet; will have to adapt as we build out labeling functionality around it, and figure out what's needed. But I think it's a pretty good, clean start.

Would love your feedback on this, and will start working on other pieces soon.

mikera commented 9 years ago

I think the LabelIndex type looks sensible.

Important to make sure that it remains as an implementation detail however - we don't really want helper types like this leaking out via APIs (there are many different ways that you can implement labelling, so the API shouldn't preclude different choices)

So I assume the intention is to have a Dataset type with a vector of one LabelIndex for each dimension (or possibly nil if that dimension is unlabelled) - correct?

metasoarous commented 9 years ago

Excellent; Agreed, leaving it an implementation detail was my intention. And yes, the idea would be to have one LabelIndex per labeled dimension.

It's been a minute since I've worked on this, as I've had some other things going on. Hopefully I'll be able to pick it up again in a few weeks.

mikera / core.matrix

More general dimension labeling functionality #220