olavolav / uniplot

Lightweight plotting to the terminal. 4x resolution via Unicode.
MIT License
370 stars 16 forks source link

Suggestion for speeding up large data #14

Closed kootenpv closed 1 year ago

kootenpv commented 1 year ago

Nice project!

So basically I have noticed that plotting has linear speed,

range(10_000_000) took 10x as long to plot as range(1_000_000)


If you are aware of the max resolution, then you could imagine you can compress the data, without resulting in a different plot

One way (probably it's possible to come up with something much better), would be to perhaps fit 1000 linear regressions on sorted data of 10_000_000 data points and use that to e.g. compress 10_000_000 rows into 1_000_000 (or less) - whatever is needed for the solution, basically the sign could be "0=stay equal, -decrease, +increase"

I was able to fit 1000 regressions in 300ms, whereas the naive plotting went from 1s to 10s, so a huge speed-up would be possible if you are able to compress the data

kootenpv commented 1 year ago

Another fun thought would be to create a lot of input data + plots (numerically as output use a 0 or 1 for pixels), and then use machine/deep learning to learn how to output the plot 😂 (actually the tricky part here is is variable sized input)

EDIT: that issue solved by below

kootenpv commented 1 year ago

How to compress 1000 data pairs (X, y) into 100 data pairs (X2, y2)

ChatGPT:

There are several ways to compress a dataset of 1000 data pairs (X, y) into 100 data pairs (X2, y2). Some methods include:

kootenpv commented 1 year ago

So yea, likely for large data you could just sample and plot that instead!

olavolav commented 1 year ago

Hi @kootenpv thanks for the suggestions & ideas! I'm afraid that as best as I can tell none of the methods you listed are guaranteed to result in the identical pixel solution.

I did some quick tests with and without lines, and got to the same result as you, the O(n) linear relation:

   Sample size versus plotting time, without lines, log-log
┌────────────────────────────────────────────────────────────┐
│                                                           ▞│ 
│                                                          ▞ │ 
│                                                         ▗▘ │ 10^-0.4 s
│                                                        ▗▘  │ 
│                                                       ▗▘   │ 
│                                                       ▌    │ 
│                                                      ▞     │ 
│                                                     ▞      │ 10^-0.9 s
│                                                    ▞       │ 
│                                                   ▟        │ 
│                                                 ▗▞▘        │ 
│                                                ▄▘          │ 
│                                              ▗▀            │ 10^-1.4 s
│                                            ▗▞▘             │ 
│▄▄▄▄▖                                      ▄▘               │ 
│    ▝▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄▄▖            ▗▄▄▄▄▀▀▀▀                 │ 
│                     ▝▀▚▄▄▄▄▄▄▀▀▀▀▘                         │ 
└────────────────────────────────────────────────────────────┘
 1               100             10^4             10^6
    Sample size versus plotting time, with lines, log-log
┌────────────────────────────────────────────────────────────┐
│                                                          ▗▀│ 
│                                                         ▞▘ │ 
│                                                       ▄▀   │ 10 s
│                                                     ▗▞     │ 
│                                                    ▞▘      │ 
│                                                  ▗▀        │ 
│                                                ▗▞▘         │ 
│                                               ▄▘           │ 
│                                             ▄▀             │ 1 s
│                                           ▗▞               │ 
│                                         ▗▞▘                │ 
│                                        ▞▘                  │ 
│                                      ▄▀                    │ 
│                                    ▄▀                      │ 0.1 s
│                                 ▗▄▀                        │ 
│▖               ▄▄           ▗▄▞▀▘                          │ 
│▝▀▀▀▀▚▄▄▄▄▄▞▀▀▀▀  ▀▀▀▀▄▄▄▄▄▞▀▘                              │ 
└────────────────────────────────────────────────────────────┘
 1               100             10^4             10^6

There is probably lots of speedup possible even if we accept O(n) since as you correctly point out the dimensionality of the output, our pixel matrix, is quite limited. I think it would be worthwhile to do a flame graph for large numbers of pixels, there should be easy speedups possible. Also plotting with lines could be much faster, in particular I'm thinking of this line here: https://github.com/olavolav/uniplot/blob/ec4ba27a495791f4839066687786f84b82bf78fb/uniplot/pixel_matrix.py#L105

Would you perhaps like to have a go at any of these improvements yourself? Can't say when I will be able to find the time, though I am curious as you are what sort of speedups are possible.

kootenpv commented 1 year ago

Would you perhaps like to have a go at any of these improvements yourself? Can't say when I will be able to find the time, though I am curious as you are what sort of speedups are possible.

I'm very swamped with work but do think it's an interesting problem

Practically speaking, if I had too slow data I would probably subsample myself 😅

Also don't have time to get familiar with codebase, maybe at a later stage!

olavolav commented 1 year ago

@kootenpv Sure let me know when you want to take a look! In the meantime I've added scripts/scaling_benchmark.py to keep track of this.

Plus this conversation made me include logarithmic plotting, so thanks 😄

Closing this issue for the moment