Performance improvements for entityNearest on Line plots

drglitch commented 9 years ago

For lineplots, Plots.Line.prototype.entityNearest is called in a multitude of usecases when setting up interactions - these include highlighting closest point, tooltips, hovers, etc (e.g. sync lineplots example).

The in current implementation, the search algorithm retrieves full entity list (which could be improved too) and then iterates over each and every datum, even if it clearly already found the closest one.

For larger datasets, e.g. 4 lines of 2,500 points each, this creates noticeable lag on interaction/redraws.

Following is a proposal to create an improved algo that uses plot.width() and queryPoint.x to create an initial guess and then iteratively walk left and right (depending on distance) from there. It makes a key assumption that dataset.entities() produces sorted values when xScale is set is this correct?

Further, the algo can be improved by calculating true eucledian/pythagorean distance to each datapoint, but this can break if xScale sort doesn't take y coords into account (which it does not).

Performance: for a chart thats 1000px wide and has 10K datapoints, this reduces a single lookup from iterating over 10K points to <100, cutting time from 400ms to ~30ms on a test machine.

Proposed code follows... would love to hear your thoughts on this as well as how this can relate to improving complex interactions.


Plottable.Plots.Line.prototype.entityNearest = function (queryPoint) {
    var _this = this;

    // proposed fix assumes that plot datasets are always sorted due to them being aligned to a scale
    // which is sorted already...

    // The search is then performed by determining best guess based on cursor position
    // and walking left/right from it until HORIZONTAL distance starts increasing.
    // Eucledian distance is not used in case all .x values are identical - then fall back to worst-case full scan.

    // Assumptions:
    // queryPoint.x is always relative to this._origin.x
    // this.width() is greater than zero

    // TODO
    // Add this into Plottable.Utils.Array?

    var entities = this.entities();
    if (!entities.length) { return; }

    var startIdx = Math.floor(entities.length * (queryPoint.x / this.width()));
    startIdx = (startIdx < 0) ? 0 : ((startIdx >= entities.length) ? entities.length-1 : startIdx); // edgecase for when the mouse exists the chart off canvas

    var guess = entities[startIdx];

    var minXDist = Infinity;
    var minYDist = Infinity;
    var closest;

    var searchDomain = [guess];

    while (searchDomain.length) {
        var entity = searchDomain.pop();
        if (typeof entity === "undefined") {
            continue;
        }

        var leftIdx = (entity.index > 0) ? entity.index-1 : 0;
        var rightIdx = (entity.index < entities.length-1) ? entity.index+1 : entity.index;

        if (!_this._entityVisibleOnPlot(entity.position, entity.datum, entity.index, entity.dataset)) {
            if (leftIdx > 0) { searchDomain.push(entities[leftIdx]); }
            if (rightIdx < entities.length) { searchDomain.push(entities[rightIdx]); }
            continue;
        }

        var xDist = Math.abs(queryPoint.x - entity.position.x);
        var yDist = Math.abs(queryPoint.y - entity.position.y);
        if (xDist < minXDist || xDist === minXDist && yDist < minYDist) {
            minXDist = xDist;
            minYDist = yDist;

            if (!(closest) || ((leftIdx > 0) && (entity.index < closest.index))) {
                searchDomain.push(entities[leftIdx]);
            }
            if (!(closest) || ((rightIdx < entities.length) && (entity.index > closest.index))) {
                searchDomain.push(entities[rightIdx]);
            }

            closest = entity;
        }
    }

    return closest;
};

jtlan commented 9 years ago

@drglitch,

This is pretty cool. I'll note that calling entities() still takes O(n) -- but you said this code has been tested and produces a noticeable improvement? We'd love to see a Pull Request with this feature.

It makes a key assumption that dataset.entities() produces sorted values when xScale is set is this correct?

The ordering on entities() depends on the original order of the data. However, in almost all use cases for Line the data is sorted by X, and entitiesNearest() returns the nearest Entity by X then Y.

drglitch commented 9 years ago

@jtlan

Thanks! Would love to implement this and will submit a pull request as soon as i get my TS environment back up and running :)

My primary concern is that the dataset may be unsorted - because since above algo relies on the dataset being sorted:

in reversely-sorted datasets, this would produce correct result but scan the entities backward, which is bad but not terrible.
in fully-unsorted datasets, you may encounter a condition where second point might be closer than 1st and 3rd, but not as close as 4rd, but you'd never get to the 4th, providing essentially an undeterministic result.

With that in mind, what are your thoughts on cleanly getting this in? Either though a config flag or similar? Of course, can always just do a big docstring warning, but that just sounds like asking for trouble.

An alternative, albeit a complicated one, is having a fall-back: if we assume dataset to be evenly distributed, and our "guess" falls outside of 5% threshold from the mean expected value, we assume the dataset to be unsorted and sort it. We can also randomly sample a dataset to see if its sorted.

Lastly, speaking of entities() - i think same question (whether dataset is sorted) is important because then they can be partitioned based on chart visible domain and screen size. As a simpler solution, what about caching intermediary values (_lightweightEntities) and rebuilding only on dataset.onupdate? Then can return them by reference and avoid rebuild.

drglitch commented 9 years ago

After poking around this stuff a bit more, got some new thoughts and proposals:

propose subclassing Dataset to SortedDataset - and if that's the case, using new code path.
alternatively, propose making all Datasets sorted internally by default, and having just 1 code path.
for dataset sorting, quicksort can be used, which in case of dataset being sorted would be O(n) but ran only ONCE upon dataset instantiation/setting data.
original code had some nasty edgecases; below is a rewritten Line.entityNearest that uses a modified binary search to find the nearest datum (assuming sorted dataset). tested thoroughly.
A similar approach, but using either eucledian or manhattan distance can be applied for Scatter plots btw.
Another minor improvement is not generating _lightweightPlotEntityToPlotEntity for each item, but rather only for the found closest item (which is what Plot.entityNearest does already anyway

As far as O(n) on .entities():

caching them won't be super easy because the cache would have to be reset on every PanZoom - as the interaction relies on having the domain reset on each computation of position property.
... so still thinking about this more.

Would love to hear more feedback!

Plottable.SortedDataset = (function() {
    function SortedDataset(data, metadata, compareFunction) {
        var sorted = data.slice().sort(compareFunction);
        this._super.constructor.call(this, sorted, metadata);
    }

    SortedDataset.prototype = new Plottable.Dataset();
    SortedDataset.prototype._super = Plottable.Dataset.prototype;

    SortedDataset.prototype.data = function data(data, compareFunction) {
        if (data != null) {
            data = data.slice().sort(compareFunction);
        }
        return this._super.data.call(this, data);
    }

    return SortedDataset;
})();

Plottable.Plots.Line.prototype.entityNearest = function (queryPoint) {
    // assumes that dataset is sorted and performs a modified binary search to determine closest
    // point with X axis taking priority over Y axis (otherwise, should use eucledian or manhattan distance)

    var _this = this;

    var entities = this._lightweightEntities();
    if (!entities.length) { return; }
    if (entities.length == 1) { return this._lightweightPlotEntityToPlotEntity(entities[0]); }

    var lo = 0,
        hi = entities.length -1,
        mid,
        minXDist = Infinity,
        minYDist = Infinity;

    var closest = entities[0];

    while (lo <= hi) {
        mid = Math.floor((lo + hi) / 2, 10);

        var xDist = Math.abs(queryPoint.x - entities[mid].position.x);
        var yDist = Math.abs(queryPoint.y - entities[mid].position.y);

        if ((xDist < minXDist) || ((xDist == minXDist) && yDist < minYDist)) {
            minXDist = xDist;
            minYDist = yDist;
            closest = entities[mid];
        }
        if (queryPoint.x >= entities[mid].position.x) {
            lo = mid + 1;
        } else {
            hi = mid - 1;
        }
    }

    return this._lightweightPlotEntityToPlotEntity(closest);
};

bluong commented 9 years ago

Hello @drglitch !

It seems that there is underlying assumption that the Dataset must be sorted in order for this code to work. Just curious on why not check if the Dataset is sorted beforehand instead of needing to create a subclass for it. As in, have 2 code paths for entityNearest(): one for sorted and one for unsorted?

jtlan commented 9 years ago

My primary concern is that the dataset may be unsorted

SortedDataset is a neat trick for this, but the Plot has no guarantee that it's operating on a SortedDataset? What if we cached the data in sorted order inside the Plot itself, and updated when x()/y() were updated or the Datasets were changed?

caching [entities] won't be super easy because the cache would have to be reset on every PanZoom - as the interaction relies on having the domain reset on each computation of position property.

Maybe we could have a method that generates position from a given datum, which would allow us to set position on-demand?

drglitch commented 9 years ago

@bluong i agree that having two code paths is suboptimal. The only reason would be backward compatibility - otherwise datasets would be required to be sorted (otherwise algo above produces garbage result). Would you be ok with making that change? As mentioned, if quicksort is used on dataset creation, then pre-sorted datasets would incur very little penalty (essentially a shallow copy) upon creation.

@jtlan one hack would be to add a private marker to .metadata() if its sorted. Though i would really like the idea of just caching a sorted ds inside Plot() as long you are cool with it :)

I'll take another look into the position determination - doing so on a per-datum basis could certainly work - then we can effectively iterate only over visible/relevant datums and greatly reduce the number of elements that need to be iterated.

Will report back when i get my typescript environment up!

bluong commented 9 years ago

Oh @drglitch I think I was actually thinking of being ok with having 2 different code paths. It is suboptimal since a catch-all solution is definitely more cleaner, but a more performant path for a very specific scenario also makes sense to me. As long as the code is documented well, this seems fine with me :)

palantir / plottable

Performance improvements for entityNearest on Line plots #2885