mikera / core.matrix

core.matrix : Multi-dimensional array programming API for Clojure
Other
702 stars 113 forks source link

More general dimension labeling functionality #220

Open metasoarous opened 9 years ago

metasoarous commented 9 years ago

The goal is to approach R's dimension naming capabilities. Vectors, matrices, and data frames all support naming. This can be quite useful for non-numerical indexing.

Started this issue on the incanter repo, and now moving here. This was @mikera's response:

we now have protocols in core.matrix that support labelling along any dimension - see clojure.core.matrix.protocols/PDimensionLabels and associated functionality.

So I think this functionality can be build in core.matrix and used directly from Incanter. What is needed I think:

  • Some rationalisation of the protocols (probably PDimensionImplementation can be deprecated in favour of PDimensionLabels, for example)
  • One or more "wrapper" classes that can add labels to an arbitrary array
  • Label support added to the main Dataset implementation (in clojure.core.matrix.impl.dataset) - currently this is quite a primitive implementation and only supports column names

Also see issue #193, equivalent to the third bullet above.

metasoarous commented 9 years ago

OK; starting to get a sense for where this is going, but have a few questions/comments.

mikera commented 9 years ago

It may be that we want to support higher dimensional datasets - so you can have labels on any dimensions (not just rows and columns which are dimensions 0 and 1 of a 2D matrix). Worth thinking about this option, it could be either the same of different from the "wrapper" solution.

On the labels vs. positional lookup:

mars0i commented 9 years ago

I think that "try as name first, if not found then try as positional index (if it is a number), else fail with error" is the right strategy for selecting by name

I agree with everything else, but I'm not sure whether I like this behavior. If I know that I have an index, I'd use mget or something that requires indexes. If I use a function that expects labels, why would I expect it to turn a number into an index? Maybe there should just be an error if a label isn't found. This is one of those areas where there's a tradeoff between flexibility and being conducive to bugs. In Clojure, we tend to go for flexibility, and I prefer that, but I'm wondering whether in this case the benefits of flexibility outweigh the costs.

mikera commented 9 years ago

My main rationale for this is the ability to mix indices and labels for different dimensions. Something like (select dataset 20 "Population") or something like that

mars0i commented 9 years ago

I see the utility of that. OK.

metasoarous commented 9 years ago

Groovy.

A separate API for label/name specific functions sounds good. Perhaps core.matrix.labels for the API and core.matrix.labels.impl for the implementation work?

As for higher dimensional datasets, it seems like there are two ways to go:

I think that while the first implementation is wonkier, it may ultimately be a better approach, as it provides more flexibility as to the array types being used for the columns (not all array types would allow for column data of varying types). I'm happy to go either way on this though, and welcome your thoughts.

mars0i commented 9 years ago

fwiw, R has separate dataframes (2D, with only column names doing any work), matrices (2D, with optional column and row names), and arrays (N-dimensional, again with optional names along any dimension). So there's a precedent for splitting data structures into a standard 2D structure (dataframes, matrices) and another N-dimensional kind of structure. On the other hand, this distinction seems very inelegant, and probably just has to do with the historical development of R. (Annoying, too, sometimes, because there are operations that are only designed for dataframes, and you have to create a new dataframe to use them--merely taking slices or averages of an array won't work.)

mikera commented 9 years ago

I'd really like to avoid having completely different APIs. Most API functions should be the same, and the few specialised API functions can live in the same namespaces just fine.

For the label-aware selecting, it may be sensible to extends the existing API in clojure.core.matrix.select, which offers more general selection functionality and could probably be made label-aware.

As for underlying implementations ("curled up" representation, etc.) - I think it is find to have different kinds of implementations / wrappers. The API should remain the same however.

metasoarous commented 9 years ago

OK. I misunderstood what you meant up there in your last bullet a few messages back. I get you now though; Sounds good.

When the time comes I may ping you about which things should have separate functions, and which can be the same.

metasoarous commented 9 years ago

I realized that my suggestion of using array-map for fast bidirectional lookup was flawed, which is funny because I came to that conclusion before when developing towards an application specific version of this. I forgot that lookup on that type grows linearly, so that's obviously no good...

Anyway, I've implemented a LabelIndex type to address this problem. It's a little more specific to the particular use case here then a general purpose fast, bidirectional lookup type would be (in particular, assumes the things being mapped to on the "right" are numerical indices...), but I think it'll fit the bill. Not sure what the final semantics will look like yet; will have to adapt as we build out labeling functionality around it, and figure out what's needed. But I think it's a pretty good, clean start.

Would love your feedback on this, and will start working on other pieces soon.

mikera commented 9 years ago

I think the LabelIndex type looks sensible.

Important to make sure that it remains as an implementation detail however - we don't really want helper types like this leaking out via APIs (there are many different ways that you can implement labelling, so the API shouldn't preclude different choices)

So I assume the intention is to have a Dataset type with a vector of one LabelIndex for each dimension (or possibly nil if that dimension is unlabelled) - correct?

metasoarous commented 9 years ago

Excellent; Agreed, leaving it an implementation detail was my intention. And yes, the idea would be to have one LabelIndex per labeled dimension.

It's been a minute since I've worked on this, as I've had some other things going on. Hopefully I'll be able to pick it up again in a few weeks.