Open varon opened 1 year ago
Hey!
Thank your for creating this issue! I'll gladly look into this and try to help you :) Can you provide a minimal example / code snippet that illustrates the problem you are experiencing?
Do I understand correctly that you want to have gaps (i.e., disconnected lines) in your plot?
Cheers, Jeroen
Thanks - I'll get that sorted out ASAP for you.
import time
from random import random
import pandas as pd
import plotly.graph_objects as go
import plotly.subplots
from pandas import DataFrame
from plotly_resampler import FigureResampler
################
# Setup the data
################
seriesLength = 100_000
values = [0]
# create a copy with flat values
flat_values = [0]
times = [time.time()]
for i in range(1, seriesLength):
prev = values[i-1]
prevTime = times[i-1]
if random() > 0.999:
delta = random() * 2 - 1
flat_values.append(float('+inf'))
else:
delta = 0
flat_values.append(1+prev)
values.append(prev + delta)
times.append(prevTime + 60)
all_data = {'times': times, 'values': values, 'flat_values': flat_values }
################
# SETUP THE PLOT
################
dataframe = DataFrame(all_data)
dataframe['times'] = pd.to_datetime(dataframe['times'], unit = 's')
dataframe = dataframe.set_index('times')
fig = FigureResampler(plotly.subplots.make_subplots())
trace = go.Scattergl(name = 'values', showlegend = True)
fig.add_trace(trace, hf_x = dataframe.index, hf_y = dataframe['values'])
fig.add_trace(trace, hf_x = dataframe.index, hf_y = dataframe['flat_values'])
fig.show_dash(mode = "inline")
This glitches horribly when the second trace is enabled, making series randomly invisible and breaking the plot dimensions/fitting.
The goal of the second series is to produce discrete, horizontal-only lines corresponding to the points in the first, without any interpolation lines going between changes in the Y values.
The use case for this is highlighting key thresholds in line charts during financial time-series analysis.
This should generate the code. It was run inside PyCharm Professional 2022.3.1
pipfile
Used:[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
jupyter = "*"
plotly = "*"
plotly-resampler = "*"
pandas = "*"
[dev-packages]
[requires]
python_version = "3.9"
@jvdd - Thanks again for looking at this. Just a reminder on this.
Looking into it @varon!
Hi @varon,
First of all thanks for the reproducible code :+1:
After looking into this issue, I arrived at the following observations:
+inf
to your data to induce a gap will thus result in getting that point selected - which will result in most of the times +inf
representing the bin. (99.2% of the selected points are +inf
with our default downsampler)None
=> doing this will result in getting a plot similar to the values
one (i.e., the gaps are now connected via vertical lines)
line_shape
to try getting rid of the vertical lines, but this seemed not deliver the desired results (@jonasvdd can you confirm that this is indeed not possible to add gaps via line_shape
)My takeaways from this issue: correctly handling gaps is an issue for this project (& time series downsapling in general). The underlying issue is that most downsampling algorithms (written in lower programming languages, e.g. C / Rust) cannot deal with NaN
s -> plotly-resampler removes NaNs in the time series & uses a heuristic to add gaps post-hoc using a diff on the x
.
Possible solutions:
NaN
positions is a quite expensive operation - should only performed once when adding the trace..NaN
s in the downsampler. Feel free to contribute to this issue https://github.com/predict-idlab/tsdownsample/issues/24@jvdd - Thank you very much for such a super-detailed explanation and reporting experience - I'm glad the code was useful to reproduce on your side.
I'm very familiar with low-level programming, and I'm looking to brush up on my Rust. I'd be happy to give that a go.
If possible, I'd love if you could provide as much detail on how to go about the task as you can. I'm an experienced developer, but I have no idea how to test or verify correctness here as I'm not familiar with the Rust ecosystem. It's unlikely, but if I do get horrendously stuck I'd love if you could lend a hand. Maybe a good opportunity to connect and collaborate - I certainly appreciate the rigor in your approach.
@varon - I am very happy to hear that you are interested in contributing!! :rocket:
Some background / additional info:
In the near future we will integrate tsdownsample
in plotly-resampler -> this will make the downsampling 10x faster - scaling plotly-resampler to billions of datapoints!
To realize such extremely fast downsampling, I optimized argmin
& argmax
(together in 1 function) in the argminmax
project. In this project I use SIMD intrinsics with runtime feature set detection to execute at runtime the most optimal implementation!
How can we handle NaNs:
I think supporting NaN
s will boil down to handling NaN
s in the argminmax project. I haven't really given it a proper look, but from quickly experimenting a while ago we observed weird / unstable behavior when NaN
s were present in the data (I think @jonasvdd can confirm this?)
If argminmax handles NaN
s, this will directly transfer to the MinMaxDownsampler
, M4Downsampler
, and (partially) MinMaxLTTB
supporting NaN
s as well.
How can you contribute?
I think looking into adding NaN
handling capabilities to argminmax is the key to support NaN
s in plotly-resampler. Looking into this & sharing your findings will be huge.
Other meaningful contributions:
I'll add a CONTRIBUTING.md file to argminmax
later today - which should help you with testing & benchmarking. The library is quite thoroughly tested - so as long as these pass i am pretty sure everything still works ;)
If you ever get stuck you can always create an Issue - and we can ofcourse connect and have more of a direct communication line :)
P.S.: I learned Rust a couple of months ago. This is how I did it:
argminmax
)As long as you have a goal & it remains fun, motivation will be a direct side-effect :)
@jvdd @varon:
Regarding the unstable behavior of the argminmax
/ tdownsampler
when having nan's.
I only had a quick glance at it and tried the snippet ⬇️
from tsdownsample import EveryNthDownsampler, LTTBDownsampler, MinMaxLTTBDownsampler, MinMaxDownsampler
import numpy as np
import pandas as pd
# construct data
n = 1_000_000
x = pd.Series(np.random.randn(n))
x.index -= x.index[0]
x[::150_000] = np.nan # ~ 7 nans
np.where(np.isnan(x))[0]
# downsample using various dtypes, downsamplers, and n_outs
for dtype in [
# np.float16,
np.float32,
np.float64]:
x_ = x.values.astype(dtype)
print(dtype, np.isnan(x_).sum())
print(LTTBDownsampler().downsample(x_, n_out=8))
# print(MinMaxLTTBDownsampler().downsample(x_, n_out=10))
print(MinMaxDownsampler().downsample(x_, n_out=8))
# print(MinMaxDownsampler().downsample(x_, n_out=10))
# print(MinMaxDownsampler().downsample(x_, n_out=20))
print(MinMaxDownsampler().downsample(x_, n_out=26))
print('-'*88)
Which gave me the following output and error
<class 'numpy.float32'> 7
[ 0 1 166667 333333 500000 666666 900000 999999]
[ 0 0 312263 444720 526056 724563 750000 750000]
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/argminmax-0.3.1/src/task.rs:68:75
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
/tmp/ipykernel_11988/1501389405.py in <cell line: 15>()
22 # print(MinMaxDownsampler().downsample(x_, n_out=10))
23 # print(MinMaxDownsampler().downsample(x_, n_out=20))
---> 24 print(MinMaxDownsampler().downsample(x_, n_out=26))
25 print('-'*88)
~/.cache/pypoetry/virtualenvs/plotly-resampler-X8YSXkmq-py3.10/lib/python3.10/site-packages/tsdownsample/downsampling_interface.py in downsample(self, n_out, parallel, *args, **kwargs)
320 ):
321 """Downsample the data in x and y."""
--> 322 return super().downsample(*args, n_out=n_out, parallel=parallel, **kwargs)
~/.cache/pypoetry/virtualenvs/plotly-resampler-X8YSXkmq-py3.10/lib/python3.10/site-packages/tsdownsample/downsampling_interface.py in downsample(self, n_out, *args, **kwargs)
109 if x is not None:
110 self._supports_dtype(x, y=False)
--> 111 return self._downsample(x, y, n_out, **kwargs)
112
113
~/.cache/pypoetry/virtualenvs/plotly-resampler-X8YSXkmq-py3.10/lib/python3.10/site-packages/tsdownsample/downsampling_interface.py in _downsample(self, x, y, n_out, parallel, **kwargs)
301 if x is None:
302 downsample_f = self._switch_mod_with_y(y.dtype, mod)
--> 303 return downsample_f(y, n_out, **kwargs)
304 elif np.issubdtype(x.dtype, np.datetime64):
305 # datetime64 is viewed as int64
PanicException: called `Option::unwrap()` on a `None` value
Hey @jvdd + @jonasvdd - Any updates on the integration of argminmax into tsdownsampler + plotly-resampler? Anything else I can help with here?
Hey @varon,
Thanks again for aiding with this codebase, we greatly appreciate you helping us out! 🤗
At the moment @jvdd and I have limited bandwidth as we are busy with writing two papers. Afterwards, this at the top of our todo-list!
As usual there is, still enough work to be done.
In order to achieve the tsdownsample
integration, we first wanted to update plotly-resampler's underlying data aggregation interface, as I did in #154. This includes some major changes, and an additional review wouldn't hurt!
Regarding other tsdownsample
and argminmax
work, I think @jvdd can provide more detailed info. https://github.com/predict-idlab/tsdownsample/issues/30
Thank you for the update - I'll definitely throw in a review here.
As this is a pretty important task for me, is there any other work that I can try to tackle? I'm obviously not as familiar with the projects, but while you guys are stuck on bandwidth I'm happy to help out where I can!
Hey @varon
In #154 we also decoupled the gap handling code - users can now pass a gap handler per trace!
I think using a custom gap handler (i.e., a tailored implementation of AbstractGapHandler
) could be a rather elegant solution for this issue (as it seems like you already know where you want to insert the gaps, you might as well insert them after the downsampling instead of before the downsampling)
You can try this out using our latest pre-release v0.9.0rc0
Hope this helps & thx again for your help with integrating tsdownsample :handshake:
Thank you for creating this great library.
We are using this to plot time-series data of discrete values.
In order to avoid having interpolation lines between these discrete values, we insert
+inf
values into the series prior to display so that it only displays in horizontal line segments of the series.This works as expected for plotly, but fails with the Resampler as it can frequently selects a stumble upon a +inf value when sampling from a series. This can cause the the series to display unreliably, depending on the exact sample chosen, the entire line often disappears failing to display anything at all for that section.
The suggested fix is, that when sampling, if a +inf value is found, try to nudge either left/right by one value place to find a
non-inf
neighbouring sample to use.