Suggestion for speeding up large data

kootenpv commented 1 year ago

Nice project!

So basically I have noticed that plotting has linear speed,

range(10_000_000) took 10x as long to plot as range(1_000_000)

If you are aware of the max resolution, then you could imagine you can compress the data, without resulting in a different plot

One way (probably it's possible to come up with something much better), would be to perhaps fit 1000 linear regressions on sorted data of 10_000_000 data points and use that to e.g. compress 10_000_000 rows into 1_000_000 (or less) - whatever is needed for the solution, basically the sign could be "0=stay equal, -decrease, +increase"

I was able to fit 1000 regressions in 300ms, whereas the naive plotting went from 1s to 10s, so a huge speed-up would be possible if you are able to compress the data

kootenpv commented 1 year ago

Another fun thought would be to create a lot of input data + plots (numerically as output use a 0 or 1 for pixels), and then use machine/deep learning to learn how to output the plot 😂 (actually the tricky part here is is variable sized input)

EDIT: that issue solved by below

kootenpv commented 1 year ago

How to compress 1000 data pairs (X, y) into 100 data pairs (X2, y2)

ChatGPT:

There are several ways to compress a dataset of 1000 data pairs (X, y) into 100 data pairs (X2, y2). Some methods include:

Random subsampling: Select 100 data pairs randomly from the original dataset.
Clustering: Group the data into clusters, and select one representative data point from each cluster.
Dimensionality reduction: Use techniques such as PCA or t-SNE to reduce the dimensionality of the data, and then select a subset of the points.
Active learning: Select the most informative data points for the model to train on.
Over-sampling or under-sampling: To balance the class distribution by oversampling the minority class or undersampling the majority class
It is important to note that the method you choose will depend on the specific characteristics of your dataset and the goal of your model.

kootenpv commented 1 year ago

So yea, likely for large data you could just sample and plot that instead!

olavolav commented 1 year ago

Hi @kootenpv thanks for the suggestions & ideas! I'm afraid that as best as I can tell none of the methods you listed are guaranteed to result in the identical pixel solution.

I did some quick tests with and without lines, and got to the same result as you, the O(n) linear relation:

   Sample size versus plotting time, without lines, log-log
┌────────────────────────────────────────────────────────────┐
│                                                           ▞│ 
│                                                          ▞ │ 
│                                                         ▗▘ │ 10^-0.4 s
│                                                        ▗▘  │ 
│                                                       ▗▘   │ 
│                                                       ▌    │ 
│                                                      ▞     │ 
│                                                     ▞      │ 10^-0.9 s
│                                                    ▞       │ 
│                                                   ▟        │ 
│                                                 ▗▞▘        │ 
│                                                ▄▘          │ 
│                                              ▗▀            │ 10^-1.4 s
│                                            ▗▞▘             │ 
│▄▄▄▄▖                                      ▄▘               │ 
│    ▝▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄▄▖            ▗▄▄▄▄▀▀▀▀                 │ 
│                     ▝▀▚▄▄▄▄▄▄▀▀▀▀▘                         │ 
└────────────────────────────────────────────────────────────┘
 1               100             10^4             10^6
    Sample size versus plotting time, with lines, log-log
┌────────────────────────────────────────────────────────────┐
│                                                          ▗▀│ 
│                                                         ▞▘ │ 
│                                                       ▄▀   │ 10 s
│                                                     ▗▞     │ 
│                                                    ▞▘      │ 
│                                                  ▗▀        │ 
│                                                ▗▞▘         │ 
│                                               ▄▘           │ 
│                                             ▄▀             │ 1 s
│                                           ▗▞               │ 
│                                         ▗▞▘                │ 
│                                        ▞▘                  │ 
│                                      ▄▀                    │ 
│                                    ▄▀                      │ 0.1 s
│                                 ▗▄▀                        │ 
│▖               ▄▄           ▗▄▞▀▘                          │ 
│▝▀▀▀▀▚▄▄▄▄▄▞▀▀▀▀  ▀▀▀▀▄▄▄▄▄▞▀▘                              │ 
└────────────────────────────────────────────────────────────┘
 1               100             10^4             10^6

There is probably lots of speedup possible even if we accept O(n) since as you correctly point out the dimensionality of the output, our pixel matrix, is quite limited. I think it would be worthwhile to do a flame graph for large numbers of pixels, there should be easy speedups possible. Also plotting with lines could be much faster, in particular I'm thinking of this line here: https://github.com/olavolav/uniplot/blob/ec4ba27a495791f4839066687786f84b82bf78fb/uniplot/pixel_matrix.py#L105

Would you perhaps like to have a go at any of these improvements yourself? Can't say when I will be able to find the time, though I am curious as you are what sort of speedups are possible.

kootenpv commented 1 year ago

Would you perhaps like to have a go at any of these improvements yourself? Can't say when I will be able to find the time, though I am curious as you are what sort of speedups are possible.

I'm very swamped with work but do think it's an interesting problem

Practically speaking, if I had too slow data I would probably subsample myself 😅

Also don't have time to get familiar with codebase, maybe at a later stage!

olavolav commented 1 year ago

@kootenpv Sure let me know when you want to take a look! In the meantime I've added scripts/scaling_benchmark.py to keep track of this.

Plus this conversation made me include logarithmic plotting, so thanks 😄

Closing this issue for the moment

olavolav / uniplot

Suggestion for speeding up large data #14