square / crossfilter

Fast n-dimensional filtering and grouping of records.
https://square.github.com/crossfilter/
Other
6.22k stars 1.31k forks source link

Filtering options? #13

Closed homerlex closed 8 years ago

homerlex commented 12 years ago

I'm looking at the example on http://square.github.com/tesseract/ and thinking of different ways I'd like to be able to filter the data. Is there any possible way to filter on the flight cities? Let's say I want to just see data for flights from PHX to ONT. Or perhaps all flights that have a destination of SAN.

Of course we could filter the data that is returned from the server but since I already have all the data loaded client side it would be nice to be able to do this type of filtering client site.

Any thoughts/ideas on this?

mbostock commented 12 years ago

Sure, just set up dimensions on origin, destination, or route. For example, for flights from PHX:

var origin = flight.dimension(function(d) { return d.origin; });
origin.filterExact("PHX");

Or for flights from PHX to ONT:

var route = flight.dimension(function(d) { return d.origin + "-" + d.destination; });
route.filterExact("PHX-ONT");
cqcallaw commented 12 years ago

@mbostock do you have a recommendation for filtering on a non-continuous range (e.g. PHX and SMF but not SAN)? It seems like it'd be possible to do by assigning a dimension to each airport code, but I'm wondering if there's a Better Way.

mbostock commented 12 years ago

Crossfilter only supports filtering contiguous ranges at the moment. For categorical dimensions (such as airport codes) I think it would make sense to implement a different type of filter can toggle arbitrary values rather than recording a contiguous range. So, I would fix that by adding a new feature. :)

cqcallaw commented 12 years ago

Noted. In the meantime, this has worked well for me so far:

originPHX = crossfilter.dimension (function(d) { return d.origin == 'PHX' });
//...
originPHX.filter(true); //use false to get all flights were the origin isn't PHX

This snippet seems quite susceptible to generalization, assuming there aren't performance concerns...

Sigfried commented 12 years ago

But I thought there were performance concerns. There are limits to the number of dimensions and making dimensions, according to the documentation is expensive.

I was building a sizable piece of code on D3 and when I saw that crossfilter did a lot of stuff that I was building in my own data management code, I switched over, but this issue is really hitting me now. I've been trying to figure out how to get this functionality, but it does look impossible without adding the feature to the software as Mike suggested, and that looks too hard for me to try.

In my code there's going to be lots of turning on and off of various values of various categorical dimensions, so I probably made a mistake trying to use crossfilter, though I've learned a bunch by playing with it.

mbostock commented 12 years ago

The first part to better support categorical dimensions is deciding on an API, so you might consider that even if you don't feel comfortable tackling the implementation.

I think the first decision is whether we want to support this as "dimensions can have multiple filters" (perhaps that can be intersected or unioned), or as "dimensions can be either quantitative or ordinal", in which case the filters on an ordinal dimension are tracked as a set of discrete values, rather than a contiguous range.

Sigfried commented 12 years ago

Just to check that I'm not missing something important: you're using the word ordinal now rather than categorical. I guess all categorical dimensions can be considered ordinal by putting them in, e.g., alphabetical order. If there are implications beyond that for the word choice, I'm not catching them.

It may be uncommon but certainly not impossible that someone will want multiple filters on a quantitative dimension, so the idea of allowing those to be intersected or unioned is nice. But your suggestion above for filters that allow toggling of individual values is very appealing. I don't have a clear sense of the performance implications.

One of my use cases is that I'd like to perform some calculation on the values of dimension X for all combinations of specific values for dimensions A, B and C, and allowing this to happen quickly as the user sets filters on dimensions B, C and D. Right now it looks to me like I have to remember the filters on B, C and D (where each filter can have multiple values) while temporarily setting single-value filters on all the combinations of A, B and C.

Is that kind of use case something that you'd like crossfilter to be able to support? I'll think more about what might be a nice API and report back later.

mbostock commented 12 years ago

I guess all categorical dimensions can be considered ordinal by putting them in, e.g., alphabetical order.

Yep, that's all I meant.

Sigfried commented 12 years ago

Did that use case make sense? Is it something you'd want to support? On Mar 31, 2012 10:12 PM, "Mike Bostock" < reply@reply.github.com> wrote:

I guess all categorical dimensions can be considered ordinal by putting them in, e.g., alphabetical order.

Yep, that's all I meant.


Reply to this email directly or view it on GitHub: https://github.com/square/crossfilter/issues/13#issuecomment-4863417

cqcallaw commented 12 years ago

As an API consumer, I'm inclined to vote for Option B (ordinal dimensions) because I don't see a clean way to represent the union and intersection operations for Option A (multi-filter) without some sort of domain-specific query language or array messiness, particularly when the operations are mixed. It does seem possible--albeit awkard--to synthesize Option B's behavior with Option A by taking the union of a set of ranges that match the discrete boundaries plus or minus some tolerance.

I must confess that I don't follow the description of @Sigfried's use case, so I can't say which option would fit that case better.

kpascual commented 12 years ago

I think option B (dimensions as ordinal/categorical) would have more practical usage than option A (multiple filters on dimension).

While I'm sure there are cases where you'd want to apply multiple filters on a dimension (e.g. lunch hour and dinner hour on a time of day dimension), I think filtering by particular categorical values is a much more common use case. Using the payments metaphor in the tutorial, I'd imagine a very common use case for a merchant would be to filter payments by multiple zip codes, states, or credit card types.

Just throwing this out there, but I was imagining an API of an optional is_categorical flag on the dimension, while using the existing filter() APIs.

mbostock commented 12 years ago

The existing API already supports categorical dimensions, provided you only need a single exact match (use filterExact). This issue is about allowing multiple values to be selected. We could enable that specifically for categorical dimensions, in which case the filter API would allow you to get or set multiple selected values. Or, we could figure out how to do it more generally for both categorical and quantitative dimensions. My guess is that enabling a different filter API for categorical dimensions would be less work and more convenient for the common use case. But, the general solution might be more powerful.

zackham commented 12 years ago

I'm taking a stab at this right now. Leaning toward the generic solution, but we'll see. I don't think we need support for intersections? The result of an intersection is going to be something that can be passed directly to filterRange today. Also, if we could enable a way to clone a dimension, you can perform intersections by applying the different filters to each dimension copy.

For the union of multiple filters, I'm just going to work on letting filter() take multiple arguments.

ghost commented 12 years ago

Zackham,

I've been testing your branch and I've noticed something strange to me, but perhaps this is the proper behavior? In your test you filter by the "total" dim, and then you get the data through the "date" dim. This gives you the right answer. But if you check through the "total" dim you get the wrong answer, it is still filtered by the first variable only and not the union. Is this by design?

zackham commented 12 years ago

beefsoup,

Thanks for the second set of eyes. I was not expanding the hi0/lo0 range to include the additional ranges. This is fixed now and I also modified the test.

Sigfried commented 12 years ago

Hi Mike,

I'm not sure the best way to email you, but trying this.

I was wondering if you'd be interested in/willing to have a brief phone conversation about the future of visualization frameworks built on top of D3? My perspective is that I'm at a firm that does a lot of contracting work on an impressive array of scientific and administrative projects for the NIH, FDA, and other organizations in the public health and clinical science arenas, and I'm trying to build something general on top of D3 to allow us to navigate a wide range of disparate data and incorporate these visualizations into web apps.

I'm working on some ideas, which, a few weeks ago, led me to take up and then abandon Tesseract/Crossfilter as my way of managing and filtering data sets. The approach that I'm taking would probably seem pretty ugly to you (it seems ugly to me quite often): OOP class hierarchies of UI elements and data elements that allow me from the perspective of any piece of data to access methods relating to how it wants to be displayed (colored, sized, etc.), whether it's been filtered, who its parents and children are; also things like: when a chunk of data results from the intersection of two dimension values, and it's display is partly based on methods related to a third dimension value, the thing figures out what to do where.

To some degree, I think what I'm trying to make (unfortunately, without sufficient experience and background), is an API to let myself and others make Spotfire/Tableau-like visualizations from RDBMS data. So the initial hierarchy of any of this data (for purposes of letting users assign columns to visualization dimensions and stuff) is: table name (or query name) --> column name --> column value --> result subset. Clearly in Crossfilter you're coming up with ways of addressing some of the same general issues.

Your data models in D3 and Crossfilter are nice and flat and clean and tie the data so closely to the visualization that all the logic about colors and inter-data-point calculations can be performed directly in the visualization code. That works well for making beautiful individual visualizations. In my case where I want to make more of a dashboard thing, with various visualizations all tied to the same or related underlying data, I think there are aspects of the data and its interrelationships that need help from classes and methods that cross visualization boundaries.

Anyway, I thought a conversation might be fruitful. What do you think?

Thanks, Sigfried

Given that

On Sat, Mar 31, 2012 at 8:48 PM, Mike Bostock < reply@reply.github.com

wrote:

The first part to better support categorical dimensions is deciding on an API, so you might consider that even if you don't feel comfortable tackling the implementation.

I think the first decision is whether we want to support this as "dimensions can have multiple filters" (perhaps that can be intersected or unioned), or as "dimensions can be either quantitative or ordinal", in which case the filters on an ordinal dimension are tracked as a set of discrete values, rather than a contiguous range.


Reply to this email directly or view it on GitHub: https://github.com/square/crossfilter/issues/13#issuecomment-4863071

Sigfried Gold

C: 301-202-4556 H: 301-920-0530 www.sigfried.org

KobaKhit commented 10 years ago

Is there any update on this? Seems like a must have feature, but no successful way of implementing it or a workaround online.

jasondavies commented 10 years ago

At the moment the only real workaround is to use dimension.filterFunction.

KobaKhit commented 10 years ago

Hello jason, would this filter be still active if I apply group().reduceCount() on the filtered dimension? Ex.g.

var XDimension = ndx.dimension(function (d) {return d.Name})
   .filterFunction(function (d) {return d==="Allyssa" || d==="Bob";})
YDimension = XDimesnion.group().reduceCount(function(d) {return d.Name;});
...
dc.renderAll();

Here is my stackoverflow question for reference.

Thanks.

jezekjan commented 10 years ago

Hello, I'm strugling with multifiltering issue as well as I'm trying to build a more general spatial filter on my data. I'm trying to use the filterFunction but it is strange to me that it is triggered as many times as the total number of records even if there are just few unique dimension values. Is this a bug, or is there any workaround for that? What I'm trying to do is to implement a 'point in polygon' filter based on dimension derived from Z-curve ordering. Thanks for any ideas.

dgerber commented 9 years ago

@jezekjan this makes filterFunction(f) evaluate f once per unique dimension value. (It still loops over all records, though.)

@@ -821,11 +821,13 @@ function crossfilter() {
       var i,
           k,
           x,
+          v = values.length && values[0];
           added = [],
           removed = [];

       for (i = 0; i < n; ++i) {
-        if (!(filters[k = index[i]] & one) ^ !!(x = f(values[i], i))) {
+        if (values[i] !== v) x = f((v = values[i]), i);
+        if (!(filters[k = index[i]] & one) ^ !!x) {
           if (x) filters[k] &= zero, added.push(k);
           else filters[k] |= one, removed.push(k);
         }

updated after #129

RandomEtc commented 8 years ago

As discussed in #151 an active fork is being developed in a new Crossfilter Organization. Please take further discussion there (if you haven't already) where it should be warmly welcomed by the new maintainers. Cheers!