leeoniya / uPlot

📈 A small, fast chart for time series, lines, areas, ohlc & bars
MIT License
8.82k stars 386 forks source link

scatter plots #107

Open leeoniya opened 4 years ago

leeoniya commented 4 years ago

uPlot.Scatter

e.g. https://academy.datawrapper.de/article/65-how-to-create-a-scatter-plot

spatial index via e.g. https://github.com/mourner/kdbush or https://github.com/mourner/flatbush

data format e.g.:

[
  [x,y,v,l,x,y,v,l],  // series 1
  [x,y,v,l,x,y,v,l],  // series 2
  [x,y,v,l,x,y,v,l],  // series 3
]

should use a Path2D shape cache

ryantxu commented 4 years ago

I don't understand why this would need a different data format/layout from the line variant? I would expect something like:

data:[
  [a1,a2,a3,...],  
  [b1,b2,b3,...], 
  [c1,c2,c3,...], 
]
config: {
  x: 0, // a
  y: 1, // b
  label: 2, // c
}

or something

leeoniya commented 4 years ago

if you look at the linked example, each point needs to encode at least x,y and the size of the point (i'm calling it "v"alue). i guess the per-point label can be left out, but i feel like it's pretty fundamental to scatter plots. in addition to this, my gut feeling is that scatterplots are less likely to be easily alignable than line charts (the majority of which are time series). if they cannot be aligned, then i cannot do a simple binary search as i do with uPlot.Line, and need a spatial index (quadtree, kd tree, etc.).

if your a1, a2 and a3 are objects, then 100k of these will take up a lot of memory. to avoid this, uPlot sticks to using flat arrays.

fundamentally, i think scatter plots are different enough to justify a different data format for performance reasons. a lot of charting libs fall for the temptation of complete uniformity across chart types, or more human-friendly formats and pay for it with performance; there's a reason why uPlot.Line is as fast as it is.

ryantxu commented 4 years ago

Obviously many ways to skin this ;) I like your columnar approach in the line chart.

Plotly uses something worth looking at -- essentially a vector for each of the attributes: x,y,size,text, etc: https://plot.ly/javascript/line-and-scatter/#data-labels-on-the-plot

trace1 = {
  x: [1, 2, 3, 4, 5],
  y: [1, 6, 3, 6, 1],
  mode: 'markers+text',
  type: 'scatter',
  name: 'Team A',
  text: ['A-1', 'A-2', 'A-3', 'A-4', 'A-5'],
  textposition: 'top center',
  textfont: {
    family:  'Raleway, sans-serif'
  },
  marker: { size: 12 }
};
leeoniya commented 4 years ago

yea, that feels right...perhaps with a tweaked uPlot take:

[
  [  // series 1
     [1,2,3],  // x
     [10,20,30],  // y
     [2.2,1.5,6.5], // v
     ["a","b","c"],  // l
  ],
  [  // series 2
     [1,2,3],  // x
     [10,20,30],  // y
     [2.2,1.5,6.5], // v
     ["a","b","c"],  // l
  ]
]

nice benefit is we can use typed arrays too, and the value and label arrays can be optional in a series.

ryantxu commented 4 years ago

FYI, what you list above is essentially the native grafana data format -- that is optionally backed by apache arrow tables

backspaces commented 4 years ago

Not to be a grumpy old man (which I am!), I'd favor scatter plots only if they do not significantly increase the size of the library. I say this as someone currently using chartjs and desperately need a smaller library like yours!

How in the world do you keep it so small! Sounds like a good medium article.

leeoniya commented 4 years ago

@backspaces

scatter should not add much code (definitely still within 30K).

also, scatter would be feature-gated like many uPlot's features, which can be compiled out:

https://github.com/leeoniya/uPlot/blob/c912bec8de728c1ba6f82f8fe903c2a100fc7325/rollup.config.js#L36-L43

i'm not sure if i'll even need a spatial index, where the indexing costs may outweigh the querying costs. most scatter plots are within 1k points, so a dumb linear scan might be quite sufficient. if i do ingest a spatial index, i'm able to get flatbush down to 3.66 KB [1] and kdbush [2] is even smaller if i don't account for variable point diameters when testing cursor proximity.

the main issue here is one of different data layout. a lot of internal util functions & loops assume aligned data across series. i would need to create an additional branch in every place that references data i0 and i1, which would make for much messier code - you can see how often those are used by searching [3]. i'm going to prototype this over the next week or two to find a path forward. i feel like scatter and log scales are the last significant missing pieces in uPlot.

How in the world do you keep it so small! Sounds like a good medium article.

i have the opposite question: how are the other libraries so huge!? some include data parsing & statistical aggregation, animations, declarative options for every possible combo of desires and styles. some have complex area fillers for stacked series and include radar, donut, pie, and other chart types. if they do timezone or DST handling, many include Luxon or Moment (which are huge), while uPlot relies on a neat hack i found [4] (but does not support IE11). Another reason is that uPlot is monolithic and not prototype of class based - many things live inside the uPlot constructor closure, so most variables can be mangled and minified without issue; uPlot only publicly exposes what the user-facing API needs. the drawback to this of course is that a lot of the code has to live in one giant 2K LOC file [5] and some difficulty around adding an additional data format like scatter. Chart.js, for example, is more java-esque where every component is a derived class, but must expose all its non-minifyable innards as private APIs. as with everything, trade-offs abound.

[1] https://github.com/mourner/flatbush/issues/27#issuecomment-600848786 [2] https://github.com/mourner/kdbush [3] https://github.com/leeoniya/uPlot/blob/master/dist/uPlot.esm.js [4] https://github.com/leeoniya/uPlot/blob/master/src/fmtDate.js#L126 [5] https://github.com/leeoniya/uPlot/blob/master/src/uPlot.js

leeoniya commented 4 years ago

sneak peek

30,000 scatter points (10k per series) in 100ms:

uplot-scatter

this is without building a spatial point index, which turns out to be a fairly expensive task (~40ms for kdbush).

if the points were square (instead of circles) and pixel-aligned, this would likely run 30-50% faster since there'd be no need for anti-aliasing.

leeoniya commented 4 years ago

rendering solid circles instead of hollow circles shaves 20-25ms (since hollow circles are stroked and filled by separate Path2D objects).

leeoniya commented 4 years ago

ok, so i optimized point rendering in 12b6beb23d58bfab52c8af052c6e80cd23a267f7, to use a single Path2D.

since i had no baseline for whether 75ms was slow or fast for a 30k point scatter plot. i tried the same with the next-fastest lib (Chart.js 3.0 alpha). spoiler: it's was never actually 75ms (that's just JS exec time):

uPlot:

Chart.js 3.0 alpha:

so about 2x as fast for init (but still without building a spatial index). and much faster for series toggle, even with rebuilding all paths. once i cache the Path2D objects, toggle will be faster still (by a lot).

another question is whether the design should accommodate multi-scale scatter plots. these are not very common, but they do exist:

image

tboerstad commented 4 years ago

I discovered uPlot lately on Hacker News, and was really impressed by the speed.

I have a project where speed has been important, a web page for creating scatter plots from CSV files. I'm currently using Plotly.js, with webgl option, and it's been quite performant, but not anything near uPlot. Plotly also has a lot more bells and whistles than I need.

I just wanted to let everyone know that exists an excited user who is waiting for scatter plots!

If people reading this are shaking their heads at CSV and performance in the same sentence, they'd be correct, as CSV parsing is currently the slowest part of the process.

leeoniya commented 4 years ago

@tboerstad csvplot looks cool :)

i paused on scatter when i ran into various questions about how and if y auto-scaling should work and some other internals that assume aligned data, etc. - quite a lot of the core needs to be ifd away for scatter/bubble plots.

i'd like to figure out #184 before moving forward here.

backspaces commented 3 years ago

Just a clarification: we may want to include "points" in this: Canvas 2D-based chart for plotting time series, lines, areas, ohlc & bars;

I needed a points graph, i.e. sorting x,y pairs. By simply sorting by x, then extracting the x, y arrays, it worked fine. And adjacent x values can be the same, no issue.

So the distinction between "points" and "scatter" graphs might be useful to make.

leeoniya commented 3 years ago

And adjacent x values can be the same, no issue.

well, mostly. you'll run into issues with being unable to hover a point properly and zooming will get wonky if either edge ends up in a same-xs territory since there's a lot internally that relies on a binary search which will fail to converge to one value. for scatter, you also expect the cursor to work by cursor proximity / one point at a time, which it obviously doesn't do in the current x-oriented mode. also, the x-auto-range won't add a nice extra buffer as it does when auto-scaling y.

for static / non-interactive charts i think it'll work fine, though.

hkang1 commented 3 years ago

hey @leeoniya, big fan of uPlot! we incorporated it into Determined - open source deep learning platform for visualizing large datasets and it's been very performant! We are just looking into scatter plots as well, and could really use it if and when it becomes available.

We're able to replicate a scatter plot with the current form of uPlot as seen below:

Screen Shot 2021-02-08 at 5 31 38 PM

But we don't have the ability to render the points differently via size or color (fill) depending on the point value. Here is an example of what we are trying to achieve (apologize for the really poor resolution on this image!):

Screen Shot 2021-02-08 at 5 38 31 PM

One thought is to update the series[n].points properties (show, fill, size, space, stroke, width) to accept a callback for each. So to render different sizes based on value, series[n].points.size can take a callback that is in a form of something like:

(self: uPlot, seriesIdx: number, pointIdx: number): number | undefined => {
  // use the `pointIdx` to get the data value and return a different size based on the value.
  ...
};

Let me know what you think

leeoniya commented 3 years ago

hey @hkang1 , there's actually quite a bit more work here than meets the eye. hover points would also need to take this into account during interaction, not just draw. point size is used to determine whether or not to show or hide points base on data density, so that needs to be tweaked. once you start getting into larger points (like a bubble scatter chart), you need to actually have hover detection for the circle boundaries. and a lot more stuff. i'd like to avoid making a partial, non-holistic api adjustment that only solves a small part of the issue, and maybe not in an optimal way.

if it's sufficient for you to simply adjust sizes, you can just implement a custom points renderer that handles size variation: https://leeoniya.github.io/uPlot/demos/draw-hooks.html. but unless you only need this for static charts, i think you'll find that this alone leaves a lot to be desired.

i'm fairly confident that proper scatter and bubble support will land in the next 6 months, but cannot give a definitive timeline for it yet.

jjech commented 3 years ago

Any news on the scatterplot roadmap?

We've been able to make do with dygraphs for line graphs and Chart.js for scatterplots, but migrating to uPlot would greatly improve the user experience for our tool!

ghost commented 3 years ago

COSMOS is looking forward to adding X/Y support to our graphing tool. Thanks for such an awesome graphing library!

leeoniya commented 3 years ago

there is some support for this now (still being refined) via mode: 2 and series.facets api.

here's a demo, for early testing: https://leeoniya.github.io/uPlot/demos/scatter.html

a lot of the implementation is still done in userland (such as quadtree construction and path renderer).

hkang1 commented 2 years ago

there is some support for this now (still being refined) via mode: 2 and series.facets api.

here's a demo, for early testing: https://leeoniya.github.io/uPlot/demos/scatter.html

a lot of the implementation is still done in userland (such as quadtree construction and path renderer).

This demo has been great to work off of!

So we ran into a weird issue where the scatter plot doesn't render properly when:

1) there's only one data point 2) or that ALL of the data points are the same value. So in the scatter plot demo, if we redefine data2 to be the following:

let data2 = filledArr(series, v => [
  filledArr(points, i => randInt(100,100)),
  filledArr(points, i => randInt(100,100)),
  filledArr(points, i => randInt(1,10000)),  // bubble size, population
  filledArr(points, i => (Math.random() + 1).toString(36).substring(7)), // label / country name
]);

If there is enough of a variance of about 0.00001 between the data points, then the rendering works ok. Wondering if there is logic around calculating ranges where it's causing a division by 0 somewhere (e.g. (value - min) / (max - min) and where max === min)...

This is what happens in the scatter demo: Screen Shot 2022-01-06 at 10 34 06 PM

In our use case it causes the browser to crash hard without console logs:

const data = [
  null,
  [
    [ 32, 32 ],
    [ 0.5, 0.6 ],
    null,  // bubble size (slight modification to handle a single size when null)
    null,  // bubble color (slight modification to handle a single color when null)
    [ 'test a', 'test a' ],
  ],
];

Screen Shot 2022-01-06 at 10 37 24 PM

When the data is slightly tweaked to:

const data = [
  null,
  [
    [ 32, 32.0000001 ],
    [ 0.5, 0.6 ],
    null,  // bubble size (slight modification to handle a single size when null)
    null,  // bubble color (slight modification to handle a single color when null)
    [ 'test a', 'test a' ],
  ],
];

then it renders somewhat ok, with the exception of the x-axis with the lack of precision: Screen Shot 2022-01-06 at 10 45 56 PM

Any clues or hints on what can be done to handle 1 data point case? Please let me know if there's anything else I can provide, thanks.

leeoniya commented 2 years ago

@hkang1 this is essentially a duplicate of https://github.com/leeoniya/uPlot/issues/620.

any custom-supplied scale ranging functions must return a non-zero range. i've updated the demo to handle this in https://github.com/leeoniya/uPlot/commit/cb1e371e8d686669a7a5b2c6366a5016594ceb95