holoviz / hvplot

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews
https://hvplot.holoviz.org
BSD 3-Clause "New" or "Revised" License
1.14k stars 108 forks source link

Difference between pandas and hvplot for ecobee dataset #289

Open michaelaye opened 5 years ago

michaelaye commented 5 years ago

Versions

Package Version
hvplot 0.4.0
holoviews 1.12.5
bokeh 1.3.4
notebook 6.0.0
python 3.7.3
Browser Chrome 75, Safari 12.1.1

Description of expected behavior and the observed behavior

The basic shape of the graph produced should be the same.

Complete, minimal, self-contained example code that reproduces the issue

Get CSV from here (250 KB): https://www.dropbox.com/s/m4cwi5kdve9x67n/report-319299697687-2019-08-07-to-2019-08-14.csv?dl=1

import pandas as pd
import hvplot.pandas

df = pd.read_csv("report-319299697687-2019-08-07-to-2019-08-14.csv",
                             comment="#")
col = "Thermostat Temperature (C)"
df[col].hvplot()
df[col].plot()

Screenshots or screencasts of the bug in action

Screenshot 2019-08-19 14 58 48

ahuang11 commented 5 years ago

My guess is that pandas was able to detect both the Date and Time columns and automagically combined them into a pd.DatetimeIndex, while hvplot only took the first column Date and plotted that.

jbednar commented 5 years ago

That's some serious assumption making on the part of Pandas, but it makes sense!

michaelaye commented 5 years ago

Could it not simply be that within a day, the data points are plotted sequentially?

michaelaye commented 5 years ago

Ah we can check that by inspecting the returned axis I guess.

ahuang11 commented 5 years ago

image You can still plot it like pandas if you specify it explicitly (although here, I don't know why pandas is offsetting the columns by one)

import pandas as pd
import hvplot.pandas

df = pd.read_csv("report-319299697687-2019-08-07-to-2019-08-14 (1).csv", comment="#")
col = "Thermostat Temperature (C)"
df.head()
df[col].hvplot('index', col)
df = df.reset_index()
df.index = pd.to_datetime(df['index'] + df['Date'], format='%Y-%m-%d%H:%M:%S')
df[col].hvplot('index', col)
michaelaye commented 5 years ago

I only get one plot when I execute above code? (The last one)

michaelaye commented 5 years ago

pandas.read_csv needs to have the parameter index_col=False, because otherwise it takes the first column as an index outside the parsed column names. Then there's no offset in columns.

michaelaye commented 5 years ago

I'm very confused now. Correcting the read_csv parsing aligns the behavior of plot() and hvplot(), but why was it different then before?

ahuang11 commented 5 years ago

The hvplot's x-axis is just an arbitrary sequential index now; it doesn't recreate pandas' automagic merging of Date and Time.

image

philippjfr commented 3 years ago

Honestly I don't know how matplotlib does this since when you index with df[col] the times are dropped. I'm guessing it just divides the day by the number of entries for that date in the index and spaces them equally.

philippjfr commented 3 years ago

Alternatively it simply uses the sequential index and uses the index to label the axes.