vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.23k stars 784 forks source link

TimeUnit 'day' parsed differently for Pandas DataFrame and .csv data #2413

Open robmitchellzone opened 3 years ago

robmitchellzone commented 3 years ago

When creating a chart from a Pandas dataframe with a datetime64[ns] column and encoding with the day timeUnit, Altair behaves as expected. However, when trying to create the same chart with data loaded from a .csv, Altair shifts the day timeUnit to a day earlier.

Example:

import pandas as pd
import altair as alt
import random
alt.data_transformers.enable('csv')

test = pd.DataFrame({
    'Date': pd.date_range('2020-01-01', '2020-01-31'),
    'Value': [random.randint(0, 10) for _ in range(31)]
})

test.to_csv('test.csv')

alt.Chart(test).mark_bar().encode(
    x='day(Date):O',
    y='sum(Value):Q',
).properties(
    title='Behaves Correctly'
)

Correct

alt.Chart('test.csv').mark_bar().encode(
    x='day(Date):O',
    y='sum(Value):Q'
).properties(
    title='Days Shifted'
)

Untitled

jakevdp commented 3 years ago

Thanks - this is an issue with how the dates are serialized to CSV combined with the fact that Javascript date parsing chooses a different timezone depending on how the date is formatted. In general, I recommend avoiding CSV serializations of data for many reasons including this one.

robmitchellzone commented 3 years ago

Is the solution to use JSON instead? Or not use datetime formats with datasets that are large enough to require storing in a file?

jakevdp commented 3 years ago

CSV is fine as long as dates are represented by the full ISO-8601 string. If you use Altair's data transformers rather than pandas, it will do the right thing.

The fundamental issue stems from javascript's built-in date parsing, which is what the Vega-Lite renderer uses to parse dates stored as strings:

> new Date('2020-11-20')
Thu Nov 19 2020 16:00:00 GMT-0800 (Pacific Standard Time)
> new Date('2020-11-20T00:00:00')
Fri Nov 20 2020 00:00:00 GMT-0800 (Pacific Standard Time)

When you pass a dataframe to alt.Chart, Altair ensures that dates are serialized as full ISO-8601 strings, and so they will be parsed correctly. Pandas to_csv() does not do this, and so Javascript parses the resulting dates incorrectly.

robmitchellzone commented 3 years ago

I see. Thanks for the explanation.