Series with `+inf` or `-inf` values completely fail to display when zoomed out

varon commented 1 year ago

Thank you for creating this great library.

We are using this to plot time-series data of discrete values.

In order to avoid having interpolation lines between these discrete values, we insert +inf values into the series prior to display so that it only displays in horizontal line segments of the series.

This works as expected for plotly, but fails with the Resampler as it can frequently selects a stumble upon a +inf value when sampling from a series. This can cause the the series to display unreliably, depending on the exact sample chosen, the entire line often disappears failing to display anything at all for that section.

The suggested fix is, that when sampling, if a +inf value is found, try to nudge either left/right by one value place to find a non-inf neighbouring sample to use.

jvdd commented 1 year ago

Hey!

Thank your for creating this issue! I'll gladly look into this and try to help you :) Can you provide a minimal example / code snippet that illustrates the problem you are experiencing?

Do I understand correctly that you want to have gaps (i.e., disconnected lines) in your plot?

Cheers, Jeroen

varon commented 1 year ago

Thanks - I'll get that sorted out ASAP for you.

varon commented 1 year ago

Reproduction code

import time
from random import random
import pandas as pd
import plotly.graph_objects as go
import plotly.subplots
from pandas import DataFrame
from plotly_resampler import FigureResampler

################
# Setup the data
################

seriesLength = 100_000
values = [0]
# create a copy with flat values
flat_values = [0]
times = [time.time()]
for i in range(1, seriesLength):
    prev = values[i-1]
    prevTime = times[i-1]
    if random() > 0.999:
        delta = random() * 2 - 1
        flat_values.append(float('+inf'))
    else:
        delta = 0
        flat_values.append(1+prev)
    values.append(prev + delta)
    times.append(prevTime + 60)

all_data = {'times': times, 'values': values, 'flat_values': flat_values }

################
# SETUP THE PLOT
################

dataframe = DataFrame(all_data)
dataframe['times'] = pd.to_datetime(dataframe['times'], unit = 's')
dataframe = dataframe.set_index('times')

fig = FigureResampler(plotly.subplots.make_subplots())
trace = go.Scattergl(name = 'values', showlegend = True)
fig.add_trace(trace, hf_x = dataframe.index, hf_y = dataframe['values'])
fig.add_trace(trace, hf_x = dataframe.index, hf_y = dataframe['flat_values'])
fig.show_dash(mode = "inline")

This glitches horribly when the second trace is enabled, making series randomly invisible and breaking the plot dimensions/fitting.

The goal of the second series is to produce discrete, horizontal-only lines corresponding to the points in the first, without any interpolation lines going between changes in the Y values.

The use case for this is highlighting key thresholds in line charts during financial time-series analysis.

This should generate the code. It was run inside PyCharm Professional 2022.3.1

`pipfile` Used:

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
jupyter = "*"
plotly = "*"
plotly-resampler = "*"
pandas = "*"

[dev-packages]

[requires]
python_version = "3.9"

varon commented 1 year ago

@jvdd - Thanks again for looking at this. Just a reminder on this.

jvdd commented 1 year ago

Looking into it @varon!

jvdd commented 1 year ago

Hi @varon,

First of all thanks for the reproducible code :+1:

After looking into this issue, I arrived at the following observations:

the downsampling algorithms we use tend to look for extrema - cfr. MinMax downsampling, LTTB downsampling, ... => adding +inf to your data to induce a gap will thus result in getting that point selected - which will result in most of the times +inf representing the bin. (99.2% of the selected points are +inf with our default downsampler)
an alternative to represent gaps, proposed in the plotly.py documentation is to use None => doing this will result in getting a plot similar to the values one (i.e., the gaps are now connected via vertical lines)
I explored the use of line_shape to try getting rid of the vertical lines, but this seemed not deliver the desired results (@jonasvdd can you confirm that this is indeed not possible to add gaps via line_shape)

My takeaways from this issue: correctly handling gaps is an issue for this project (& time series downsapling in general). The underlying issue is that most downsampling algorithms (written in lower programming languages, e.g. C / Rust) cannot deal with NaNs -> plotly-resampler removes NaNs in the time series & uses a heuristic to add gaps post-hoc using a diff on the x.

Possible solutions:

add gaps post-hoc using the nan positions in the original data. Downside is that calculating the NaN positions is a quite expensive operation - should only performed once when adding the trace..
support NaNs in the downsampler. Feel free to contribute to this issue https://github.com/predict-idlab/tsdownsample/issues/24

varon commented 1 year ago

@jvdd - Thank you very much for such a super-detailed explanation and reporting experience - I'm glad the code was useful to reproduce on your side.

I'm very familiar with low-level programming, and I'm looking to brush up on my Rust. I'd be happy to give that a go.

If possible, I'd love if you could provide as much detail on how to go about the task as you can. I'm an experienced developer, but I have no idea how to test or verify correctness here as I'm not familiar with the Rust ecosystem. It's unlikely, but if I do get horrendously stuck I'd love if you could lend a hand. Maybe a good opportunity to connect and collaborate - I certainly appreciate the rigor in your approach.

jvdd commented 1 year ago

@varon - I am very happy to hear that you are interested in contributing!! :rocket:

Some background / additional info: In the near future we will integrate tsdownsample in plotly-resampler -> this will make the downsampling 10x faster - scaling plotly-resampler to billions of datapoints! To realize such extremely fast downsampling, I optimized argmin & argmax (together in 1 function) in the argminmax project. In this project I use SIMD intrinsics with runtime feature set detection to execute at runtime the most optimal implementation!

How can we handle NaNs: I think supporting NaNs will boil down to handling NaNs in the argminmax project. I haven't really given it a proper look, but from quickly experimenting a while ago we observed weird / unstable behavior when NaNs were present in the data (I think @jonasvdd can confirm this?)
If argminmax handles NaNs, this will directly transfer to the MinMaxDownsampler, M4Downsampler, and (partially) MinMaxLTTB supporting NaNs as well.

How can you contribute? I think looking into adding NaN handling capabilities to argminmax is the key to support NaNs in plotly-resampler. Looking into this & sharing your findings will be huge.

Other meaningful contributions:

If you have SIMD experience - you could review the SIMD algorithm in https://github.com/jvdd/argminmax/blob/main/src/simd/generic.rs
If you have profiling (tuning) experience - you could profile & further (possibly) optimize the argminmax library.

I'll add a CONTRIBUTING.md file to argminmax later today - which should help you with testing & benchmarking. The library is quite thoroughly tested - so as long as these pass i am pretty sure everything still works ;)
If you ever get stuck you can always create an Issue - and we can ofcourse connect and have more of a direct communication line :)

P.S.: I learned Rust a couple of months ago. This is how I did it:

skimmed the first 5-6 chapters of the book while toying around with the examples
started programming with a goal (which was trying to optimize argminmax)

As long as you have a goal & it remains fun, motivation will be a direct side-effect :)

jonasvdd commented 1 year ago

@jvdd @varon:

Regarding the unstable behavior of the argminmax/ tdownsampler when having nan's. I only had a quick glance at it and tried the snippet ⬇️

from tsdownsample import EveryNthDownsampler, LTTBDownsampler, MinMaxLTTBDownsampler, MinMaxDownsampler
import numpy as np
import pandas as pd

# construct data
n = 1_000_000
x = pd.Series(np.random.randn(n))
x.index -= x.index[0]
x[::150_000] = np.nan  # ~ 7 nans
np.where(np.isnan(x))[0]

# downsample using various dtypes, downsamplers, and n_outs
for dtype in [
    # np.float16, 
    np.float32, 
    np.float64]:
    x_ = x.values.astype(dtype)
    print(dtype, np.isnan(x_).sum())
    print(LTTBDownsampler().downsample(x_, n_out=8))
    # print(MinMaxLTTBDownsampler().downsample(x_, n_out=10))
    print(MinMaxDownsampler().downsample(x_, n_out=8))
    # print(MinMaxDownsampler().downsample(x_, n_out=10))
    # print(MinMaxDownsampler().downsample(x_, n_out=20))
    print(MinMaxDownsampler().downsample(x_, n_out=26))
    print('-'*88)

Which gave me the following output and error

<class 'numpy.float32'> 7
[     0      1 166667 333333 500000 666666 900000 999999]
[     0      0 312263 444720 526056 724563 750000 750000]
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/argminmax-0.3.1/src/task.rs:68:75
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
/tmp/ipykernel_11988/1501389405.py in <cell line: 15>()
     22     # print(MinMaxDownsampler().downsample(x_, n_out=10))
     23     # print(MinMaxDownsampler().downsample(x_, n_out=20))
---> 24     print(MinMaxDownsampler().downsample(x_, n_out=26))
     25     print('-'*88)

~/.cache/pypoetry/virtualenvs/plotly-resampler-X8YSXkmq-py3.10/lib/python3.10/site-packages/tsdownsample/downsampling_interface.py in downsample(self, n_out, parallel, *args, **kwargs)
    320     ):
    321         """Downsample the data in x and y."""
--> 322         return super().downsample(*args, n_out=n_out, parallel=parallel, **kwargs)

~/.cache/pypoetry/virtualenvs/plotly-resampler-X8YSXkmq-py3.10/lib/python3.10/site-packages/tsdownsample/downsampling_interface.py in downsample(self, n_out, *args, **kwargs)
    109         if x is not None:
    110             self._supports_dtype(x, y=False)
--> 111         return self._downsample(x, y, n_out, **kwargs)
    112 
    113 

~/.cache/pypoetry/virtualenvs/plotly-resampler-X8YSXkmq-py3.10/lib/python3.10/site-packages/tsdownsample/downsampling_interface.py in _downsample(self, x, y, n_out, parallel, **kwargs)
    301         if x is None:
    302             downsample_f = self._switch_mod_with_y(y.dtype, mod)
--> 303             return downsample_f(y, n_out, **kwargs)
    304         elif np.issubdtype(x.dtype, np.datetime64):
    305             # datetime64 is viewed as int64

PanicException: called `Option::unwrap()` on a `None` value

varon commented 1 year ago

Hey @jvdd + @jonasvdd - Any updates on the integration of argminmax into tsdownsampler + plotly-resampler? Anything else I can help with here?

jonasvdd commented 1 year ago

Hey @varon,

Thanks again for aiding with this codebase, we greatly appreciate you helping us out! 🤗

At the moment @jvdd and I have limited bandwidth as we are busy with writing two papers. Afterwards, this at the top of our todo-list!

As usual there is, still enough work to be done. In order to achieve the tsdownsample integration, we first wanted to update plotly-resampler's underlying data aggregation interface, as I did in #154. This includes some major changes, and an additional review wouldn't hurt!

Regarding other tsdownsample and argminmax work, I think @jvdd can provide more detailed info. https://github.com/predict-idlab/tsdownsample/issues/30

varon commented 1 year ago

Thank you for the update - I'll definitely throw in a review here.

As this is a pretty important task for me, is there any other work that I can try to tackle? I'm obviously not as familiar with the projects, but while you guys are stuck on bandwidth I'm happy to help out where I can!

jvdd commented 1 year ago

Hey @varon

In #154 we also decoupled the gap handling code - users can now pass a gap handler per trace! I think using a custom gap handler (i.e., a tailored implementation of AbstractGapHandler) could be a rather elegant solution for this issue (as it seems like you already know where you want to insert the gaps, you might as well insert them after the downsampling instead of before the downsampling)

You can try this out using our latest pre-release v0.9.0rc0

Hope this helps & thx again for your help with integrating tsdownsample :handshake:

predict-idlab / plotly-resampler