square / crossfilter

Fast n-dimensional filtering and grouping of records.
https://square.github.com/crossfilter/
Other
6.22k stars 1.31k forks source link

String search #107

Closed ronkorving closed 8 years ago

ronkorving commented 10 years ago

Hi. I love crossfilter and have recently come across a situation where text search has become a requirement on my data set. I know that's not a trivial thing, but I was wondering if any work, experimentation or discussion has occurred that I should be aware of. Is this something that could or will be added some day?

protobi commented 10 years ago

Concur that text search, like a RegEx filter, would be great. As a newbie looking at the code from the outside, it appears that dimensions are represented as sorted arrays of column values, and filters track the min and max index of values that satisfy it. Thus it seems range filters like dictionary order (case sensitive) with upper and lower bounds would be easy (e.g. "Bee" <= x < "Bar") would match. But text match could yield arbitrary subsets which would require rethinking how filters are applied.

mtraynham commented 10 years ago

You can use dimension.filterFunction(function(d) {...}). See https://github.com/square/crossfilter/wiki/API-Reference#wiki-dimension_filterFunction

As an example:

// Match strings of length 5 or greater that only include letters and numbers.
var regex = /^([a-z0-9]{5,})$/; 
var data = [{foo: "abc1"}, {foo: "abc12"}, {foo: "abc123"}]

var index = crossfilter(data);
var fooAccessor = function(d) { return d.foo; };
var dimension = index.dimension(fooAccessor);

// d is the dimension value;  return false for exclude
dimension.filterFunction(function(d) {
    return regex.test(d);
});

Based on the above example, this matches records 2 & 3. This allows you to filter specific records based on something other than exact match or range match.

Albeit this does not take advantage of Crossfilter's bisect functionality which can binary search the sorted data (as used in range filtering or exact match). Thus, this can be much slower for larger data sets and/or execution complexity of the filter function.

As @gradualstudent aluded to, if you are prefix matching, then filterRange would be the way to go. You would create a range that encompasses all values past the letters presented by the user. For instance, if user typed 'Bee', you would set the start of the range to the entered text. The range end would be the entered text, plus the lowest naturally sorted unicode character. That would allow you to encompass all values that start with 'Bee'.

['Bee', 'Bee\uFFFF']

I have not tried any of this before, so correct me if I'm wrong.

RandomEtc commented 8 years ago

Thanks for your contributions and sorry for silence on this side. As discussed in #151 an active fork is being developed in a new Crossfilter Organization. Please take further discussion there (if you haven't already) where it should be warmly welcomed by the new maintainers. Cheers!