square / crossfilter

Fast n-dimensional filtering and grouping of records.
https://square.github.com/crossfilter/
Other
6.22k stars 1.31k forks source link

Fix bug in premature filter bitset widening algorithm #158

Closed SoftwareDreamer closed 9 years ago

SoftwareDreamer commented 9 years ago

Interesting enough, this went unnoticed for a long time.. When I was scanning the codebase to understand how crossfilter works, I saw this code in src\crossfilter.js line 106:

    if (M >= 32 ? !one : m & (1 << M) - 1) {
      filters = crossfilter_arrayWiden(filters, M <<= 1);
    }

This seemed odd to me, so I added an

console.warn("widen array to " + M + " bits / record");

after the filters = line, and sure enough the output while creating the first two dimensions was

widen array to 16 bits / record
widen array to 32 bits / record

This is clearly a bug, there will be an associated pull request with the bugfix:

    if (M >= 32 ? !one : (m > ((1 << M) - 1))) {
      filters = crossfilter_arrayWiden(filters, M <<= 1);
    }

The result is now, as expected, that the array only gets widened on the 9th and 17th filter, thus saving memory and CPU for filter counts < 17. This had a very noticable perf impact on our big dataset (up to 4 mio. uservotes for a german voting TV show).

jasondavies commented 9 years ago

Thanks! This went unnoticed for two years and five months. I think you deserve a prize!

SoftwareDreamer commented 9 years ago

It's my first contribution to an OS project on Github, I'm feeling kind of good right now. :-)

The price would be easier ways to contribute.. It's pretty hard to contribute as a windows dev, would you accept pull requests for a gulpfile (a makefile is super windows-dev unfriendly, almost kept me from sending the PR)? An maybe use the new contributor license feature of Github instead of the google docs one (not nearly as big of an issue as the makefile)?

jasondavies commented 9 years ago

Rather than using gulp, since the build process is quite simple I suppose it might be preferable to have a pure Node-based build without relying on heavyweight dependencies.

Which new contributor license feature are you referring to?

SoftwareDreamer commented 9 years ago

The feature I meant is somewhat different to what I found after some search (CONTRIBUTING.md, you already use it but I didn't notice the popup on the PR): https://github.com/blog/1184-contributing-guidelines I thought there was support for actively agreeing to a contributors license on a PR, but that might have been on another platform.

Hrm, don't know about the pure node-based build, Grunt/gulp became widely used in many JS projects like a year ago (that's at least my perception), personally I never saw gulp as a "fat" dependency, but that's maybe a matter of taste. There are already dependencies on e.g. uglify, and I never had problems with gulp (and I use it a lot for my little spikes, which essentially use gulp for a simple livereload watch and minification/scss etc..)

Anyways, if non-gulp is a must, I would look into the pure node.js script, even if it probably gets bigger without the grunt/gulp plugin ecosystem at hand, as long as it means windows-devs have a nice onramp experience with the project.

SoftwareDreamer commented 9 years ago

Ok, probably it was https://www.clahub.com/ This seems like a nicer way of doing it, even though it means a dependency on clahub.com instaed of on google docs. The integration in Github is really nice..