Open metasoarous opened 9 years ago
OK; starting to get a sense for where this is going, but have a few questions/comments.
[rownames colnames data]
arity? Or use :rownames, :colnames
keyword arguments instead? I prefer the latter (passing nil
if you want rownames but not colnames feels weird to me), but realize it would be a breaking change so am happy to defer.select-columns
uses .indexOf
to find the index. This has linear computational complexity, which for large data sets can become problematic if you do a lot of indexing (this turned out to be a big problem in my application). An alternative would be to use an array-map (of name -> index
) instead. This gives quick access to both an ordered rownames
and positional indices given label names. However, this comes at the cost of increased housekeeping when removing columns/rows or creating submatrices. So, what do you think the right approach is here? Should use of an array-map be the default implementation, a secondary implementation, or an optional wrapper?PDimensionImplementation
can be deprecated.It may be that we want to support higher dimensional datasets - so you can have labels on any dimensions (not just rows and columns which are dimensions 0 and 1 of a 2D matrix). Worth thinking about this option, it could be either the same of different from the "wrapper" solution.
On the labels vs. positional lookup:
I think that "try as name first, if not found then try as positional index (if it is a number), else fail with error" is the right strategy for selecting by name
I agree with everything else, but I'm not sure whether I like this behavior. If I know that I have an index, I'd use mget
or something that requires indexes. If I use a function that expects labels, why would I expect it to turn a number into an index? Maybe there should just be an error if a label isn't found. This is one of those areas where there's a tradeoff between flexibility and being conducive to bugs. In Clojure, we tend to go for flexibility, and I prefer that, but I'm wondering whether in this case the benefits of flexibility outweigh the costs.
My main rationale for this is the ability to mix indices and labels for different dimensions. Something like (select dataset 20 "Population")
or something like that
I see the utility of that. OK.
Groovy.
A separate API for label/name specific functions sounds good. Perhaps core.matrix.labels
for the API and core.matrix.labels.impl
for the implementation work?
As for higher dimensional datasets, it seems like there are two ways to go:
columns
vector in DataSet
's implementation).I think that while the first implementation is wonkier, it may ultimately be a better approach, as it provides more flexibility as to the array types being used for the columns (not all array types would allow for column data of varying types). I'm happy to go either way on this though, and welcome your thoughts.
fwiw, R has separate dataframes (2D, with only column names doing any work), matrices (2D, with optional column and row names), and arrays (N-dimensional, again with optional names along any dimension). So there's a precedent for splitting data structures into a standard 2D structure (dataframes, matrices) and another N-dimensional kind of structure. On the other hand, this distinction seems very inelegant, and probably just has to do with the historical development of R. (Annoying, too, sometimes, because there are operations that are only designed for dataframes, and you have to create a new dataframe to use them--merely taking slices or averages of an array won't work.)
I'd really like to avoid having completely different APIs. Most API functions should be the same, and the few specialised API functions can live in the same namespaces just fine.
For the label-aware selecting, it may be sensible to extends the existing API in clojure.core.matrix.select
, which offers more general selection functionality and could probably be made label-aware.
As for underlying implementations ("curled up" representation, etc.) - I think it is find to have different kinds of implementations / wrappers. The API should remain the same however.
OK. I misunderstood what you meant up there in your last bullet a few messages back. I get you now though; Sounds good.
When the time comes I may ping you about which things should have separate functions, and which can be the same.
I realized that my suggestion of using array-map
for fast bidirectional lookup was flawed, which is funny because I came to that conclusion before when developing towards an application specific version of this. I forgot that lookup on that type grows linearly, so that's obviously no good...
Anyway, I've implemented a LabelIndex
type to address this problem. It's a little more specific to the particular use case here then a general purpose fast, bidirectional lookup type would be (in particular, assumes the things being mapped to on the "right" are numerical indices...), but I think it'll fit the bill. Not sure what the final semantics will look like yet; will have to adapt as we build out labeling functionality around it, and figure out what's needed. But I think it's a pretty good, clean start.
Would love your feedback on this, and will start working on other pieces soon.
I think the LabelIndex type looks sensible.
Important to make sure that it remains as an implementation detail however - we don't really want helper types like this leaking out via APIs (there are many different ways that you can implement labelling, so the API shouldn't preclude different choices)
So I assume the intention is to have a Dataset type with a vector of one LabelIndex for each dimension (or possibly nil if that dimension is unlabelled) - correct?
Excellent; Agreed, leaving it an implementation detail was my intention. And yes, the idea would be to have one LabelIndex per labeled dimension.
It's been a minute since I've worked on this, as I've had some other things going on. Hopefully I'll be able to pick it up again in a few weeks.
The goal is to approach R's dimension naming capabilities. Vectors, matrices, and data frames all support naming. This can be quite useful for non-numerical indexing.
Started this issue on the incanter repo, and now moving here. This was @mikera's response:
Also see issue #193, equivalent to the third bullet above.