vega / vega-lite

A concise grammar of interactive graphics, built on Vega.
https://vega.github.io/vega-lite/
BSD 3-Clause "New" or "Revised" License
4.65k stars 604 forks source link

`yearweek` and `week` timeUnit transform not following ISO8601 #9369

Open mattijn opened 3 months ago

mattijn commented 3 months ago

As was mentioned in https://github.com/vega/altair/issues/3433.

Given the following duckdb sql query:

import pandas as pd
import duckdb
data = {'date': ['2012-01-01', '2022-01-01', '2024-06-10']}

df = pd.DataFrame(data)
df["date"] = pd.to_datetime(df.date)
print(duckdb.query("SELECT YEARWEEK(date) FROM df").to_df())

Will return

   yearweek(date)
0          201152
1          202152
2          202424

But the following Vega-Lite does not:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.17.0.json",
  "data": {
    "values": [
      {"date": "2012-01-01T00:00:00"},
      {"date": "2022-01-02T00:00:00"},
      {"date": "2024-06-10T00:00:00"}
    ]
  },
  "mark": {"type": "bar"},
  "encoding": {
    "y": {"field": "date", "timeUnit": "yearweek", "type": "nominal"},
    "tooltip": [
      {"field": "date", "type": "temporal"},
      {
        "field": "date",
        "format": "%Y W%U (%A)",
        "timeUnit": "yearweek",
        "title": "Vega week: sunday-based",
        "type": "temporal"
      }
    ]
  }
}

Returns

image

Open the Chart in the Vega Editor

Can we make both the week and yearweek timeUnit transform follows ISO8601 so it is monday-based and not sunday-based (as sql engines such duckdb are doing as well: https://duckdb.org/docs/sql/functions/datepart.html#weekdate)?

joelostblom commented 3 months ago

I'm in support for making timeunit transforms compatible with the ISO time and date standards. For backwards compatibility purposes, this probably needs to either happen in a major release or be configurable.

domoritz commented 3 months ago

Does this need to happen in Vega-Lite or Vega?

mattijn commented 3 months ago

This requires a change to the TimeUnit Transform in Vega. I can reduce the issue to the following Vega spec:

Open the Chart in the Vega Editor

{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "data": [
    {
      "name": "SOURCE",
      "values": [
        {"date": "2012-01-01T00:00:00"},
        {"date": "2022-01-02T00:00:00"},
        {"date": "2024-06-10T00:00:00"}
      ]
    },
    {
      "name": "APPLY_TIMEUNIT",
      "source": "SOURCE",
      "transform": [
        {"type": "formula", "expr": "toDate(datum.date)", "as": "date_in_ms"},
        {
          "field": "date_in_ms",
          "type": "timeunit",
          "units": ["year", "week"],
          "as": ["yearweek_date", "yearweek_date_end"]
        }
      ]
    }
  ]
}

image

PBI-David commented 3 months ago

@mattijn just an FYI but running your spec gives me this which is different to what you have, Is this not all down to UTC and how the dates are parsed?

image

mattijn commented 3 months ago

Thanks for commenting @PBI-David! It made me realize that my used example is not pointing out the situation as described. I had another look and came one step further. Making me to believe it is a vega-lite issue!

Adapted spec, which should be time zone proof (Vega-Editor):

{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "data": [
    {
      "name": "SOURCE",
      "values": [
        {"date": "2012-01-01T00:00:00", "year": 2012, "month": 0, "day": 1},
        {"date": "2022-01-02T00:00:00", "year": 2022, "month": 0, "day": 2},
        {"date": "2024-06-10T00:00:00", "year": 2024, "month": 5, "day": 10}
      ]
    },
    {
      "name": "APPLY_TIMEFORMAT",
      "source": "SOURCE",
      "transform": [
       {"type": "formula", "expr": "utc(datum.year, datum.month, datum.day)", "as": "date_ms"},
       {"type": "formula", "expr": "timeFormat(datum.date_ms, timeUnitSpecifier(['year', 'week']))", "as": "date_yearweek"},
       {"type": "formula", "expr": "timeFormat(datum.date_ms, timeUnitSpecifier(['week'], {'week': 'W%V'}))", "as": "date_isoweek"},
       {"type": "formula", "expr": "timeFormat(datum.date_ms, timeUnitSpecifier(['year'], {'year': '%G'}))", "as": "date_isoyear"},
       {"type": "formula", "expr": "timeFormat(datum.date_ms, timeUnitSpecifier(['year', 'week'], {'year-week': '%G W%V'}))", "as": "date_isoyearweek"}
      ]
    }
  ]
}

image

Basically I think the TimeMultiFormat specifier should be extended for the timeUnitSpecifier with the following:

"timeFormat(value, timeUnitSpecifier(units, {'week': 'W%V', 'year-week': '%G W%V'}))"

Where I've the feeling, based on these tests:

'timeFormat(datum.x, timeUnitSpecifier(["year","month","date","hours","minutes","seconds"], {"year-month":"%b %Y ","year-month-date":"%b %d, %Y "}))'

That it currently is just the Vega default with the following Vega-Lite adaptation: {"year-month":"%b %Y ","year-month-date":"%b %d, %Y "}.

Addendum: choices are, or include an adaptation in Vega-Lite or change the default in Vega.

PBI-David commented 3 months ago

@mattijn for completeness, I now see the same as you.

image