dangotbanned commented 1 month ago

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

No datasets are included in the package
- Instead, included is only a single 18.7KB file metadata.parquet
- The file describes all versions of all datasets
  - provided they are accessible via both npm and github
Strong support for typing
- Annotations are generated from the metadata itself
- https://github.com/vega/altair/blob/9e9deeb95668d2c4e7d30311e85a8f9f6acdc88c/altair/datasets/_typing.py
So far, 5 backends have been implemented, instead of only pandas
- These provide precise IDE completions, with a lot of help from https://github.com/narwhals-dev/narwhals
Users can opt-in to caching remote dataset requests
- With the "polars" backend, the slowest I've had on a cache-hit is 0.1s to load
  - https://cdn.jsdelivr.net/npm/vega-datasets@v2.9.0/data/flights-200k.json

Examples

These all come from the docstrings of:

Loader
Loader.from_backend
Loader.__call__

import altair as alt
from altair.datasets import Loader

data = Loader.from_backend("polars")
>>> data
Loader[polars]

cars = data("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

data = Loader.from_backend("pandas")
cars = data("cars")

>>> type(cars)
pandas.core.frame.DataFrame

data = Loader.from_backend("pandas[pyarrow]")
cars = data("cars", tag="v1.29.0")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                string[pyarrow]
Miles_per_Gallon    double[pyarrow]
Cylinders            int64[pyarrow]
Displacement        double[pyarrow]
Horsepower           int64[pyarrow]
Weight_in_lbs        int64[pyarrow]
Acceleration        double[pyarrow]
Year                string[pyarrow]
Origin              string[pyarrow]
dtype: object

data = Loader.from_backend("pandas")
source = data("stocks", tag="v2.10.0")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

data = Loader.from_backend("pyarrow")
source = data("stocks", tag="v2.10.0")

>>> source.column_names
['symbol', 'date', 'price']

Tasks

[x] Add test suite
- [x] (https://github.com/vega/altair/blob/b70aef883721ce1ce905e1ec8e82938eb4859257/tests/test_datasets.py)
- [x] Test/add handling for no remote connection (https://github.com/vega/altair/pull/3631/commits/90428a625bc3928684018d57861f608574812fd8)
[x] Finish docs
- [x] altair.datasets
  - [x] Loader.__call__
    - [x] Parameters
    - [x] Examples
  - [x] _readers._Reader
  - [x] _typing.Metadata (Align with revised descriptions from (https://github.com/vega/altair/pull/3631/commits/c572180ebc7d876714a38688c53f7e4af87abd93))
- [x] tools.datasets
  - [x] Application
  - [x] models.py
  - [x] github.py
  - [x] npm.py
  - [x] semver.py
[x] Mitigating _PyArrowReader JSON limitation (https://github.com/vega/altair/pull/3631/commits/3fbc759233fdf0203a2f8685245152732f57276a), (https://github.com/vega/altair/pull/3631/commits/4f5b4de6d894a1297bd2edfaecb72c5eefa48bc7)

Resolved

Investigate bundling metadata

- Investigating bundling metadata (https://github.com/vega/altair/pull/3631/commits/22a50396822dc48d4ed63bae3c8837dc28dab6ad), (https://github.com/vega/altair/pull/3631/commits/17923404866003e27a510be793ab65c290d8802a) - Depending on how well the compression scales, it might be reasonable to include this for some number of versions - Deliberately including redundant info early on - can always chip away at it later

npm does not have every version available GitHub

- Sources - [npm/vega-datasets](https://cdn.jsdelivr.net/npm/vega-datasets/) - **Fixed with:** https://data.jsdelivr.com/v1/packages/npm/vega-datasets - https://github.com/vega/vega-datasets/tags - Known missing - https://github.com/vega/vega-datasets/releases/tag/v1.9.0 - https://github.com/vega/vega-datasets/releases/tag/v1.6.0 - https://github.com/vega/vega-datasets/releases/tag/v1.4.0 - [feat(DRAFT): Add a source for available npm versions](https://github.com/vega/altair/pull/3631/commits/0bbf2e9ec2ff2f1d79b4d4f68128625daab2d947) - Need to add some handling to invalidate these entries returned from [list-repository-tags](https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags) once confirmed they cannot be requested from `npm` - Can technically request from `github`, but during testing this was much slower - Also, these versions **would not** have been available from https://github.com/altair-viz/vega_datasets, since that only used `npm`

Plan strategy for user-configurable dataset cache

- Everything so far has been building the tools for a compact bundled index - [1](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/vendor_datasets.py#L394), [2](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/metadata_full.parquet), [3](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/metadata_full-schema.json), [4](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/tags.parquet), [5](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/tags-schema.json) - Refreshing the index would not be included in `altair`, each release would simply ship with changes **baked in** - Trying to avoid bloating `altair` package size with datasets - **User-facing** - Goal of requesting each unique dataset version **once** - The user cache would not need to be updated between `altair` versions - Some kind of **opt-in** config to say *store the datasets in this directory please* - Basic solution would be defining an env variable like `ALTAIR_DATASETS_DIR` - When not provided, always perform remote requests - User motivation would be that it would be faster to enable caching

Deferred

Reducing cache footprint

e.g. storing the .(csv|tsv|json) files as .parquet
Need to do more testing on this though to ensure
- the shape of each dataset is preserved
- where relevant - intentional errors remain intact

Investigate providing a decorator to add a backend

Will be trivial for the user-side, since they don't need to be concerned about imports
Just need to provide these attributes:
- _name: LiteralString
- _read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
- _scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]

Investigate ways to utilize (https://github.com/vega/vega-datasets/blob/main/SOURCES.md)

Can use a similar approach to #3600
Explored a little locally, finishing it off in this PR would add quite a lot of markdown parsing code

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

How many datasets, size (per & total)?
What version range does a given sha cover?
Blocked: Running into issues with
- pandas/pyarrow group_by warnings
- https://github.com/pola-rs/polars/issues/18394
- Missing nw.Expr.(first|last)
- nw.Expr.(head|tail)(1) not equivalent in a group_by().agg(...) context
  - pandas -> scalar
  - polars -> list
- pl.Enum translating to non-ordered pd.Categorical

polars-native solution

```py from __future__ import annotations from pathlib import Path import polars as pl from altair.datasets import Loader, _readers data = Loader.from_backend("polars") # NOTE: Enable caching, populate with some responses data.cache_dir = Path.home() / ".altair_cache" data("cars") data("cars", tag="v1.5.0") data("movies") data("movies", tag="v1.24.0") data("jobs") if cache_dir := data.cache_dir: cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir()) else: msg = "Datasets cache unset" raise TypeError(msg) # NOTE: Lots of redundancies, many urls point to the same data (sha) >>> pl.read_parquet(_readers._METADATA).shape # (2879, 9) # NOTE: Version range per sha tag_sort: pl.Expr = pl.col("tag").sort() tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ") # NOTE: Producing a name only when the file is already in the cache file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then( pl.concat_str("sha", "suffix") ) cache_summary: pl.DataFrame = ( pl.scan_parquet(_readers._METADATA) .group_by("dataset_name", "suffix", "size", "sha") .agg(tag_range=tag_range) .select(pl.exclude("sha"), file_name=file_name) .sort("dataset_name", "size") .collect() ) >>> cache_summary.shape # (116, 5) >>> cache_summary.head(10) ``` ``` shape: (10, 5) ┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐ │ dataset_name ┆ suffix ┆ size ┆ tag_range ┆ file_name │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ str ┆ str │ ╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡ │ 7zip ┆ .png ┆ 3969 ┆ v1.5.0 - v2.10.0 ┆ null │ │ airports ┆ .csv ┆ 210365 ┆ v1.5.0 - v2.10.0 ┆ 608ba6d51fa70584c3fa1d31eb9453… │ │ annual-precip ┆ .json ┆ 266265 ┆ v1.29.0 - v2.10.0 ┆ null │ │ anscombe ┆ .json ┆ 1703 ┆ v1.5.0 - v2.10.0 ┆ null │ │ barley ┆ .json ┆ 8487 ┆ v1.5.0 - v2.10.0 ┆ 8dc50de2509b6e197ce95c24c98f90… │ │ birdstrikes ┆ .csv ┆ 1223329 ┆ v2.0.0 - v2.10.0 ┆ null │ │ birdstrikes ┆ .json ┆ 4183924 ┆ v1.5.0 - v1.31.1 ┆ null │ │ budget ┆ .json ┆ 374289 ┆ v1.5.0 - v2.8.1 ┆ null │ │ budget ┆ .json ┆ 391353 ┆ v2.9.0 - v2.10.0 ┆ null │ │ budgets ┆ .json ┆ 18079 ┆ v1.5.0 - v2.10.0 ┆ 8a909e24f698a3b0f6c637c30ec95e… │ └───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘ ```

dangotbanned commented 1 week ago

@MarcoGorelli I thought now might be a good time to loop you in on this PR.

Don't be alarmed by the size 😅, the narwhals stuff is just these two modules:

I'd appreciate any input really, but a few things I thought to mention:

`(Data|Lazy)Frame.filter(**constraints)`

This seems to be working for me (pandas, pyarrow, polars), using purely the public narhwals API:

https://github.com/vega/altair/blob/88d449154edc733e2b7ea682e28dd9eeb4a09b40/altair/datasets/_readers.py#L385-L397

Do you think something similar to BaseFrame._flatten_and_extract but for BaseFrame.filter and nw.expr.When might fit into narwhals? I'm happy with this living in altair since it is so simple, but that makes me question whether I've missed an obvious flaw.

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

I've documented this in some detail at the end of the PR description, it intersects a few issues. It seems like I've been spoiled by polars, and having a hard time expressing the polars-native solution via narwhals.

Overall, I've found narwhals really enjoyable to use. You and all the other contributors have clearly put in a lot hard work into it 🎉

joelostblom commented 1 week ago

I really like this, thank you @dangotbanned! I'm looking forward to Altair's datasets being up to date, which will close a few long standing PRs and be of even more importance with all the upstream fixes that recently went into the Vega dataset repo. I'm suspecting this will also make it easier to maintain going forward than the previous approach, so great!

A few comments from looking over the diff briefly:

The old vega_datasets package made some dtypes conversion on load, e.g. converted the 'Year' column in cars to a date. It seems like it is read in as a string here based on the example in your comment. Should we add the same dtype conversions here as well or do you think it is better to skip them?
Could with_backend() be made optional and have a "smart default" based on a priority order and inspecting the available installed packages?
One of the benefits of having the data sets as part of the altair package would be simplified imports and less setup code for using them. To go further in that direction, I wonder if it would be possible to simplify the user-facing API (or provide a shortcut), so that instead of the current:
```
data = alt.datasets.Loader.with_backend("polars")
cars = data("cars")
```
The shorter version could look something like:
```
cars = alt.datasets.load('cars')
```
This is similar to what seaborn (sns.load_dataset('cars')) and plotly (px.dataset.cars()) uses as well. Those packages don't have as advanced functionality with the different backends, urls, etc, but it would be great if it was possible to somehow include a convenient shorter syntax (at least for the most common cases). Potentially, the shorter syntax could still be compatible with the methods of Loader or accept the most common ones as param names (such as with_backend and url)?

dangotbanned commented 1 week ago

I really like this, thank you @dangotbanned! I'm looking forward to Altair's datasets being up to date, which will close a few long standing PRs and be of even more importance with all the upstream fixes that recently went into the Vega dataset repo. I'm suspecting this will also make it easier to maintain going forward than the previous approach, so great!

Thanks for taking a look @joelostblom, really appreciate the feedback 😄

I'm going to split each comment into a separate response, to try and give enough detail on each

dangotbanned commented 1 week ago

Data types on read

@joelostblom

The old vega_datasets package made some dtypes conversion on load, e.g. converted the 'Year' column in cars to a date. It seems like it is read in as a string here based on the example in your comment. Should we add the same dtype conversions here as well or do you think it is better to skip them?

I did see this, but I thought it could quickly get complex between versions. movies is probably the most common one we use that has a change in column names between versions.

I'd prefer if we could solve this in a more generalized way, that is easier to maintain.

For example, with polars an issue is that read_json has quite limited options - when compared to read_csv. You can work around this though, and for cars you can see the difference here:

pl.read_json

```py import polars as pl >>> pl.read_json(fp_cars).schema.to_python() {'Name': str, 'Miles_per_Gallon': int, 'Cylinders': int, 'Displacement': float, 'Horsepower': int, 'Weight_in_lbs': int, 'Acceleration': float, 'Year': str, # <------------ 'Origin': str} ```

pl.read_json -> pl.read_csv

Here we can simply roundtrip to get the dates parsed: ```py from io import BytesIO buf = BytesIO() pl.read_json(fp_cars).write_csv(buf) >>> pl.read_csv(buf, try_parse_dates=True).schema.to_python() {'Name': str, 'Miles_per_Gallon': int, 'Cylinders': int, 'Displacement': float, 'Horsepower': int, 'Weight_in_lbs': int, 'Acceleration': float, 'Year': datetime.date, # <------------ 'Origin': str} ```

I've done some testing, and this resolves many cases:

_pl_read_json_roundtrip

```py def _pl_read_json_roundtrip(source: Path | IOBase, /, **kwds: Any) -> pl.DataFrame: """ Try to utilize better date parsing available in `pl.read_csv`_. `pl.read_json`_ has few options when compared to `pl.read_csv`_. Chaining the two together - *where possible* - is still usually faster than `pandas.read_json`_. .. _pl.read_json: https://docs.pola.rs/api/python/stable/reference/api/polars.read_json.html .. _pl.read_csv: https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html .. _pandas.read_json: https://pandas.pydata.org/docs/reference/api/pandas.read_json.html """ from io import BytesIO import polars as pl df = pl.read_json(source, **kwds) if any(tp.is_nested() for tp in df.schema.dtypes()): # NOTE: Inferred as `(Geo|Topo)JSON`, which wouldn't be supported by `read_csv` return df buf = BytesIO() df.write_csv(buf) if kwds: SHARED_KWDS = {"schema", "schema_overrides", "infer_schema_length"} kwds = {k: v for k, v in kwds.items() if k in SHARED_KWDS} return pl.read_csv(buf, try_parse_dates=True, **kwds) ```

But there are some non-standard date formats that require manual intervention, e.g:

data("stocks").with_columns(pl.col("date").str.to_date("%b %-d %Y"))

pandas.read_json seems to have more options - I'm not that familiar with them though.

I should note though that you can pass in arguments to data(..., **kwds):

https://github.com/vega/altair/blob/f2823b42c1cd7d4d84a94845a8ccb0d6767756ce/altair/datasets/__init__.py#L157-L158

dangotbanned commented 1 week ago

Inferred Backend

@joelostblom

Could with_backend() be made optional and have a "smart default" based on a priority order and inspecting the available installed packages?

The main issue with this is that it would break IDE completions. Currently, you get the correct completions based on the initialized backend.

Screen recording

https://github.com/user-attachments/assets/2e2fa3c5-5f45-4ae6-9997-c9ce38dadc7b

pandas

![image](https://github.com/user-attachments/assets/c09b3b89-17b4-40da-951d-4d4411f99a9a)

polars

![image](https://github.com/user-attachments/assets/ee2d6bfe-4d02-418c-8eba-2a80380cadab)

This can be statically known since backend_name is used to match an @overload.

AFAIK, inferring based on installed packages would be too dynamic. Meaning we either return some opaque (Any) object - or say that every data(...) returns one of any of the supported types.

I'm open to ideas on this, but the options I can see seem like a worse user experience.

You only need to initialize it once anyway, so it isn't too different from how you'd use the various PluginRegistry classes

dangotbanned commented 1 week ago

Less setup code

@joelostblom

One of the benefits of having the data sets as part of the altair package would be simplified imports and less setup code for using them. To go further in that direction, I wonder if it would be possible to simplify the user-facing API (or provide a shortcut), so that instead of the current:
data = alt.datasets.Loader.with_backend("polars")
cars = data("cars")
The shorter version could look something like:
cars = alt.datasets.load('cars')

So, a lot of what I touched on in https://github.com/vega/altair/pull/3631#issuecomment-2480677838 would apply here.

I do agree though that we can shorten things a little bit. An easy one would be loader = Loader.with_backend:

loader

*We could alternatively define it as a function* ![image](https://github.com/user-attachments/assets/a958f5d1-eef9-43c8-900b-a9809611faaf)

Another option, which I suppose is currently an easter egg:

from altair.datasets import data

https://github.com/vega/altair/blob/f2823b42c1cd7d4d84a94845a8ccb0d6767756ce/altair/datasets/__init__.py#L323-L327 ![image](https://github.com/user-attachments/assets/46046e83-2d73-481b-b763-cbdb24dcb46c) The above will raise an import error if `pandas` isn't installed though.

   cars = alt.datasets.load('cars')

load specifically is an interesting one.

I'd had thoughts about adding a Loader.(json|load) method - to provide similar functionality to (US_10M|World_110M)

load would then align with json.load

This is similar to what seaborn (sns.load_dataset('cars')) and plotly (px.dataset.cars()) uses as well. Those packages don't have as advanced functionality with the different backends, urls, etc, but it would be great if it was possible to somehow include a convenient shorter syntax (at least for the most common cases). Potentially, the shorter syntax could still be compatible with the methods of Loader or accept the most common ones as param names (such as with_backend and url)?

While I was working on this I looked around for inspiration, finding sklearn.datasets. Quite quickly though I abandoned that thread due to how complex I found it.

Thanks for linking seaborn and plotly, they are much simpler and more what I was looking for earlier.

Interesting timing however, plotly now has backend support as of 3 days ago:

It uses narwhals as well (thanks to @FBruzzesi), and works in a pretty similar way to what I've done (minus the typing). The key part is there still needs to be an explicit return_type - which is used for an import.

@joelostblom for the user-side, this means a polars user of plotly writes:

import plotly.express as px

px.data.iris(return_type="polars")
px.data.gapminder(return_type="polars")
px.data.stocks(return_type="polars")

Whereas the same user in altair writes:

from altair.datasets import Loader

data = Loader.with_backend("polars")
data("iris")
data("gapminder")
data("stocks")

So although the entry requires a few extra characters, everything afterwards is shorter. This can add up when you consider how many datasets we have available:

alt.datasets._typing.Dataset

https://github.com/vega/altair/blob/f2823b42c1cd7d4d84a94845a8ccb0d6767756ce/altair/datasets/_typing.py#L22-L97

joelostblom commented 1 week ago

Thanks for the detailed responses! I will break up my replies too to keep them more readable.

1. Data types on read

Pandas

I agree that it would be great to solve this in a generalized way if possible. For pandas, it seems like there is no way to do this automatically for CSV files unfortunately. I also tried convert_dtypes and infer_objects, but it seems like none of the will identify and convert dates.

For json it seems like there is a flag in read_json to parse all date like formats automatically, so we could roundtrip to json from CSV if it is not too time consuming. This flag is enabled by default and seems to work well in my testing, e.g.

url = 'https://cdn.jsdelivr.net/npm/vega-datasets@v2.9.0/data/unemployment-across-industries.json'
pd.read_json(url).dtypes

series                 object
year                    int64
month                   int64
count                   int64
rate                  float64
date      datetime64[ns, UTC]  <--

Turning it off causes date to be parsed as an object as expected:

pd.read_json(url, convert_dates=False).dtypes

series     object
year        int64
month       int64
count       int64
rate      float64
date       object  <--

However, some columns that are not considered "date-like" by read_json were previously converted to dates, e.g. the integer Year column in the cars dataframe:

url = 'https://cdn.jsdelivr.net/npm/vega-datasets@v2.9.0/data/cars.json'
pd.read_json(url).dtypes

Name                 object
Miles_per_Gallon    float64
Cylinders             int64
Displacement        float64
Horsepower          float64
Weight_in_lbs         int64
Acceleration        float64
Year                 object  <--
Origin               object

The vega_datasets packages converted these to dates:

from vega_datasets import data

data.cars().dtypes

Name                        object
Miles_per_Gallon           float64
Cylinders                    int64
Displacement               float64
Horsepower                 float64
Weight_in_lbs                int64
Acceleration               float64
Year                datetime64[ns]  <--
Origin                      object

I am not sure this conversion is the right thing to todo since they really are integers and 1970 is not the same as 1970-01-01. I believe this was done to make the formatting on the axis look better (ie not be formatted as 1,970), but maybe there are better ways to do this now, e.g. using one of the alternatives suggested here https://github.com/vega/altair/discussions/3140, especially if Vega eventually changes this behavior as per https://github.com/vega/vega/issues/1681

Polars

I am less familiar with polars (although I have been using it more lately and enjoy it!), but your suggestion with round-tripping to csv seems reasonable, unless that is very time consuming (e.g. for the large flights dataset it might take long? or does the lazy loading/executing make this less of an issue?).

Non-dataframe data

What about the US10M and Worl_110M datasets? They are both loaded as dictionaries rather than dataframes in vega_datasets, so we would probably need to use special logic for those regardless I suppose.

joelostblom commented 1 week ago

2 & 3 Less setup code (e.g via the backend being optional)

I will reply to 2 and 3 together as my main concern is to have a shorter path of access to the sample data. The automatic type inference you have enabled via the current Loader.with_backend() approach is really cool and I want to make it clear that I'm not suggesting to remove it. I think we should keep it for the part of the user base that could take advantage of this effectively.

My suggestion with a shortcut that requires less setup, is mainly to cater towards users who are coming from more of a data analytics/science background, but are not as well-versed in programming/software. The focus here might be more on just getting some data to try things out versus understanding what it means to use a loader class and specify a backend. I also think the altair datasets are often used for demo purpose where it is convenient to have less typing and demonstrate how easy it is to access sample data.

The main issue with this is that it would break IDE completions.

Is it possible to design things so that this is left up to the user? With that I mean that if they want IDE completions, they need to specify a backend, but if they don't, they can still load in the data? Or is this not possible with the @overload design?

We do actually also have the native altair.Data class. Technically this could maybe be used as the default backend requiring no additional packages installed. However, no data type inference for columns would happen in this case, so this has no user-facing practical advantage over just passing the URL to the spec, since the type suffixes would be required on the column names in the altair spec in either case, and the data would look like a dictionary when displayed in the IDE.

Another option, which I suppose is currently an easter egg: from altair.datasets import data

What a convenient shortcut! Since this already does not involve any IDE completions, could we infer which dataframe package to use here before passing it to Loader.with_backend() on line 326? And advertise approach as a shortcut that is not as flexible/powerful as using the loader directly but sufficient in cases where users just want to get their hands on some data to try things out (so remove the deprecation warning). Possibly also changing the name to load instead of data since it is already within the datasets namespace?

Small note: I noticed that when I try alt.datasets.Loader.with_backend("pandas")('cars'), I get an error saying that I need parquet support installed although there the cars file is json and there is no parquet reading needed.

Click to see full traceback

``` --------------------------------------------------------------------------- ImportError Traceback (most recent call last) Cell In[7], line 2 1 import altair as alt ----> 2 alt.datasets.Loader.with_backend("pandas")('cars') File ~/proj/altair/altair/datasets/__init__.py:241, in Loader.__call__(self, name, suffix, tag, **kwds) 135 def __call__( 136 self, 137 name: Dataset | LiteralString, (...) 141 **kwds: Any, 142 ) -> IntoDataFrameT: 143 """ 144 Get a remote dataset and load as tabular data. 145 (...) 239 price: [[39.81,36.35,43.22,28.37,25.45,...,199.91,210.73,192.06,204.62,223.02]] 240 """ --> 241 return self._reader.dataset(name, suffix, tag=tag, **kwds) File ~/proj/altair/altair/datasets/_readers.py:138, in _Reader.dataset(self, name, suffix, tag, **kwds) 130 def dataset( 131 self, 132 name: Dataset | LiteralString, (...) 136 **kwds: Any, 137 ) -> IntoDataFrameT: --> 138 df = self.query(**validate_constraints(name, suffix, tag)) 139 it = islice(df.iter_rows(named=True), 1) 140 result = cast("Metadata", next(it)) File ~/proj/altair/altair/datasets/_readers.py:186, in _Reader.query(self, *predicates, **constraints) 172 def query( 173 self, *predicates: OneOrSeq[IntoExpr], **constraints: Unpack[Metadata] 174 ) -> nw.DataFrame[IntoDataFrameT]: 175 """ 176 Query multi-version trees metadata. 177 (...) 183 https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.filter.html 184 """ 185 frame = ( --> 186 nw.from_native(self.scan_fn(_METADATA)(_METADATA)) 187 .filter(_parse_predicates_constraints(predicates, constraints)) 188 .lazy() 189 .collect() 190 ) 191 if not frame.is_empty(): 192 return frame File ~/miniconda3/envs/altdata/lib/python3.12/site-packages/pandas/io/parquet.py:651, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) 498 @doc(storage_options=_shared_docs["storage_options"]) 499 def read_parquet( 500 path: FilePath | ReadBuffer[bytes], (...) 508 **kwargs, 509 ) -> DataFrame: 510 """ 511 Load a parquet object from the file path, returning a DataFrame. 512 (...) 648 1 4 9 649 """ --> 651 impl = get_engine(engine) 653 if use_nullable_dtypes is not lib.no_default: 654 msg = ( 655 "The argument 'use_nullable_dtypes' is deprecated and will be removed " 656 "in a future version." 657 ) File ~/miniconda3/envs/altdata/lib/python3.12/site-packages/pandas/io/parquet.py:67, in get_engine(engine) 64 except ImportError as err: 65 error_msgs += "\n - " + str(err) ---> 67 raise ImportError( 68 "Unable to find a usable engine; " 69 "tried using: 'pyarrow', 'fastparquet'.\n" 70 "A suitable version of " 71 "pyarrow or fastparquet is required for parquet " 72 "support.\n" 73 "Trying to import the above resulted in these errors:" 74 f"{error_msgs}" 75 ) 77 if engine == "pyarrow": 78 return PyArrowImpl() ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'. A suitable version of pyarrow or fastparquet is required for parquet support. Trying to import the above resulted in these errors: - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow. - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet. ```

Interesting timing however, plotly now has backend support as of 3 days ago:

Indeed interesting timing! I agree with you that there is not much difference when importing multiple package and using a non-default backend. However, a px user does have access to simply typing px.data.stocks() if they don't care about which backend to use. Currently this default case is much longer in altair with alt.datasets.Loader.with_backend('pandas')('stocks') (alternatively broken up over multiple lines with a separate import as in your examples). I think it could be helpful to have a alt.datasets.load('stocks') as per the above or even alt.datasets.stocks().

joelostblom commented 1 week ago

A new separate point: I notice in the docstring for the url class that it requires a backend to be set (or at least it seems like that from the example). Would it be possible to make the URL accessible without having any dataframe package installed since Altair can be used directly with the URL string even if pandas/polars etc are not installed?

jonmmease commented 1 week ago

Thanks for working on this @dangotbanned. I haven't looked deeply into the implementation, but wanted to share a few thoughts related to the API.

The most common workflow I've seen for the vega_datasets module is to import it on one line, and then load a dataset on another line.

from vega_datasets import data
source = data.movies()

I think we should strive to make this two line pattern as clear and concise as possible, and I feel like the paradigm of creating a loader object and using that to load the dataset makes this a little more complex. I'm definitely aligned with the desire to allow the type system to propagate the dataset names and DataFrame types, but I'd be interested in talking about the API a bit.

I would personally lean toward the return_type style you mentioned that plotly express is using, and have the default value of the argument be pandas. Another approach, that might be a little unorthodox, would be to select the backend type as part of the import.

So for pandas:

from altair.datasets.pd import data
df = data("movies")

or for polars

from altair.datasets.pl import data
df = data("movies")

I think what you have is very close to this, but we'd build the loader for the particular backend on import. Compared to the return_type flow, this has the advantage of not requiring you to pass the backend argument repeatedly for the case where you are loading multiple datasets.

dangotbanned commented 1 week ago

Thanks for all the feedback everyone!

@jonmmease (https://github.com/vega/altair/pull/3631#issuecomment-2480853741) @joelostblom (https://github.com/vega/altair/pull/3631#issuecomment-2480816377, https://github.com/vega/altair/pull/3631#issuecomment-2480832609, https://github.com/vega/altair/pull/3631#issuecomment-2480832711) @mattijn (https://github.com/vega/altair/discussions/3150#discussioncomment-11280516)

I'm a bit overwhelmed on where to start with responses, but please know that I have read through everything you guys have had to say (at least once 😅) and really appreciate it all.

I will try my best to answer/feed-back early next week. In the meantime, there may be things that I can clarify/give context for by linking to various docstrings in the PR.

Some generally useful bits, either for docs or to try running locally:

mattijn commented 1 week ago

A RFC well done🥳. I like the last approach as suggested by @jonmmease. It's concise and clear. Btw, I can imagine that implementing support for previous tag-versions of the Vega-datasets is useful during development, but would advocate to implement a single version (the most recent) as static constant. Similar to what we have for the Vega, Vega-Lite and Vega-Embed versions.

dangotbanned commented 1 week ago

A RFC well done🥳.

Thanks @mattijn

Btw, I can imagine that implementing support for previous tag-versions of the Vega-datasets is useful during development, but would advocate to implement a single version (the most recent) as static constant. Similar to what we have for the Vega, Vega-Lite and Vega-Embed versions.

This one I think I have a short answer for! One thing to consider is what does the transition from vega_datasets -> altair.datasets look like for us?

We're 4 years behind the current version, and a sudden change will break a lot of the docs. Off the top of my head, movies has a lot of renames and some datasets like iris were removed.

If we have the tag available, we could use that as a pin for any specific examples that stop working. I think that could smoothen things out until we get everything else caught up - and make it possible to split up the work over time.

Now I'm thinking about it, I don't think I wrote this down anywhere 🤦‍♂️ - but in my head I thought it would make getting this PR merged easier

dangotbanned commented 1 week ago

@joelostblom

Thanks for the detailed responses! I will break up my replies too to keep them more readable.

1. Data types on read

Pandas

I agree that it would be great to solve this in a generalized way if possible. For pandas, it seems like there is no way to do this automatically for CSV files unfortunately. I also tried convert_dtypes and infer_objects, but it seems like none of the will identify and convert dates.

Thanks for investigating the pandas-side of things.

I think there are a few things we could do upstream in https://github.com/vega/vega-datasets to make this easier for us to get consistent results across dataframe packages. These ideas aren't all mutually-exclusive, but some may remove the need for others:

Standardise/replace SOURCES.md

For this task specifically, having a schema/format string listed for all the datasets would be helpful. Especially when something isn't ISO 8601, e.g:

Ideally this would be defined in a tabular format or JSON - since we could extract that data more reliably than from markdown.

Use a consistent date, time, datetime format

There is a history of various datetime formats changing to resolve parsing errors:

Switch to file formats that can encode data types

Definitely less likely to be accepted (https://github.com/vega/vega-datasets/pull/628#issuecomment-2476299461) than the other two, but would be the most reliable and fastest (for python) option.

I can't seem to get the issue comments to load, but these demonstrate using parquet

Probably blocked until (https://github.com/vega/vega/issues/3961) is resolved, sadly

joelostblom commented 1 week ago

Switch to file formats that can encode data types

This sounds like the ideal option to me, great idea! Rather than "switch", maybe we could just add a parquet file for each existing data set in addition to the current file types? Then we could load the parquet with specified dtypes by default in altair, but it would still be possible to load other formats for demonstration purposes, and no existing VL examples would break or need updating. Over time, we could probably reduce the number of datasets that provide non-parquet formats to reduce the overall size of the package/repo.

mattijn commented 1 week ago

One thing to consider is what does the transition from vega_datasets -> altair.datasets look like for us?

Good point. I would say, for this PR, we will use the tagged dataset version that is close to the currently used version in our documentation. This way, it will almost not break any examples in the docs and can be added as a feature in a minor release (5.5).

As a follow-up, we can make a PR to update the tagged dataset version to the latest. That change will need updates to all our documentation. I think it makes sense to do this in a major release (6.0), but I am also open to do this in a minor release.

dangotbanned commented 1 week ago

https://github.com/vega/altair/pull/3631#issuecomment-2481605587

Switch to file formats that can encode data types

https://github.com/vega/altair/pull/3631#issuecomment-2481647641

This sounds like the ideal option to me, great idea! Rather than "switch", maybe we could just add a parquet file for each existing data set in addition to the current file types? Then we could load the parquet with specified dtypes by default in altair, but it would still be possible to load other formats for demonstration purposes, and no existing VL examples would break or need updating. Over time, we could probably reduce the number of datasets that provide non-parquet formats to reduce the overall size of the package/repo.

@joelostblom

Apologies, I used the wrong term there, your suggestion is actually what I proposed in:

Also there was this (https://github.com/vega/vega-datasets/issues/627#issuecomment-2481321593) which has some extra related context.

So I think we're both in agreement that this would be the best path forward. But we would need @domoritz on board as well (https://github.com/vega/vega-datasets/issues/627#issuecomment-2476410307)

domoritz commented 1 week ago

Let me know if there is anything you need from the vega-daatsets repo side. Would it make sense for some logic to live there so that it gets updated if we add a new datasets for example?

joelostblom commented 1 week ago

@domoritz Would it be reasonable to have a parquet (or other binary format) file available for each of the vega datasets? The primary benefit for us would be that the column data types are stored with the data and we would not need to inclue custom column dtype conversion logic to e.g. turn strings into dates (which we used to have in the old vega datasets python package)

domoritz commented 1 week ago

Hmm, would you want to have that in the npm package or just the repo?

mattijn commented 1 week ago

I find the term Loader as user-facing API a bit confusing, since I feel I'm not loading urls. I like the current data-source-first approach

The current approach:

from vega_datasets import data

source_url = data.cars.url
source_pandas = data.cars()

Could we consider something like this instead?

from altair.datasets import data
# from vega_datasets import data  # DeprecationError: update import

source_url = data.cars.url
source_pandas = data.cars.to_pandas()
source_polars = data.cars.to_polars()
# source_pandas = data.cars()  # DeprecationWarning: be explicit, use updated syntax

This approach would allow .to_pandas() to handle any backend-specific nuances, such as engine="pyarrow" or spatial="geopandas".

Regarding backend=pandas["pyarrow"], I found it a bit unclear, especially since this option isn’t mentioned in the pandas optional dependencies installation guide. The pandas user guide uses engine="pyarrow", so opting for this for consistency.

Edit: The Loader class can still be used as base class under the hood.

dangotbanned commented 1 week ago

Would it be reasonable to have a parquet (or other binary format) file available for each of the vega datasets? The primary benefit for us would be that the column data types are stored with the data and we would not need to inclue custom column dtype conversion logic to e.g. turn strings into dates (which we used to have in the old vega datasets python package)

@joelostblom

Would a less-heavy compromise be doing this for a subset of the datasets?

While I was looking for test cases for (https://github.com/vega/altair/pull/3631#discussion_r1845658286) - it seemed like we might not need this for all of them

test_polars_read_json_roundtrip

https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/tests/test_datasets.py#L463-L481

@domoritz would you feel more comfortable with this if we were talking about either:

All datasets with a ...

format string column (usually temporal data type)
format string column, which this PR can demonstrate has inconsistent parsing (e.g. pandas != polars, etc)
non-ISO 8601 format string column

Hmm, would you want to have that in the npm package or just the repo?

@domoritz

If it were just resources that we'd be bundling with altair, the GitHub repo option would work. We can support that with very few changes to https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/tools/datasets/github.py

However, the main benefit that npm brings for this use case is the lack of rate limiting or need for an auth token. For GitHub, this would be limited to:

Unauth: 60 requests/hour
Auth: 5000 requests/hour

I think hitting the unauthenticated limit would probably be quite easy to do in a notebook environment. CI workflows may also face issues, even with the higher limit if the project is large enough.

This could be mitigated by opting-in to caching, but it could be frustrating to encounter the rate limit as a beginner.

Loader.cache_dir

https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/altair/datasets/__init__.py#L290-L311

joelostblom commented 1 week ago

@joelostblom Would a less-heavy compromise be doing this for a subset of the datasets?

While I was looking for test cases for (https://github.com/vega/altair/pull/3631#discussion_r1845658286) - it seemed like we might not need this for all of them test_polars_read_json_roundtrip

@domoritz would you feel more comfortable with this if we were talking about either:

All datasets with a ...

format string column (usually temporal data type)

format string column, which this PR can demonstrate has inconsistent parsing (e.g. pandas != polars, etc)

non-ISO 8601 format string column

Yup, this sounds like a good compromise good to me.

domoritz commented 1 week ago

I'm supportive of using proper date types and having parquet files in the Vega datasets repo. I'd prefer not to have the parquet files in npm.

dangotbanned commented 1 week ago

I'm supportive of using proper date types and having parquet files in the Vega datasets repo. I'd prefer not to have the parquet files in npm.

@domoritz is there a solution where we can use GitHub via jsdelivr? https://www.jsdelivr.com/documentation#id-github

I'd been confused by what use-case this endpoint would serve that wasn't available via get-a-blob

I haven't found any mention of rate limits there, so maybe this is a motivating scenario?

domoritz commented 1 week ago

In the Vega editor, we download releases btw https://github.com/vega/editor/blob/92d95435062bb8d0b53fe86772432f5eaef51719/scripts/vendor.sh#L19. Do you know whether there are rate limits on releases?

dangotbanned commented 1 week ago

In the Vega editor, we download releases btw vega/editor@92d9543/scripts/vendor.sh#L19. Do you know whether there are rate limits on releases?

@domoritz

If I've read that correctly, that would be what I'm referring to as tags - so i think so?

https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/tools/datasets/github.py#L250-L297

The rate limit endpoint is free - so you could try it before and after to test?

But the jsdelivr one seems to be something different - while still effectively accessing the same resource?

Edit

GitHub seems to be accessible using:

Whereas npm looks like:

domoritz commented 1 week ago

I did a quick test. Running

gh api \
                                                               -H "Accept: application/vnd.github+json" \
                                                               -H "X-GitHub-Api-Version: 2022-11-28" \
                                                               /rate_limit

before and after yarn in the Vega editor. Nothing changed about remaining. But I am also just downloading via my ip address so I don't see why this would be measured in the rate limits.

jsdelivr caches as far as I understand.

I think there are plenty of options for downloading so I woud be surprised if we are locking ourselves into something

dangotbanned commented 6 days ago

I think there are plenty of options for downloading so I woud be surprised if we are locking ourselves into something

@domoritz yeah I think we should go ahead with this route!

While this is being considered, I feel the need to bring up pyarrow & .json datasets. I'm doing some gymnastics in that backend to account for only line-delimited JSON being natively supported:

_PyArrowReader

https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/altair/datasets/_readers.py#L351-L411

From a quick search, that accounts for 34 datasets. If we were able to access these as .parquet - then pyarrow could read everything except the spatial datasets natively.

I know this changes the scope of the request, but I thought I'd ask anyway 😅 All of the backends would benefit from the performance and data types as a bonus.

mattijn commented 6 days ago

Maybe better is something that can be explained as a "source that is expressed as dataframe or url".

from altair.datasets import data

source_pandas = data.cars.to_df(engine="pandas")
# data.cars.to_df()  # default `pandas`
# data.cars.url

source_polars = data.cars.to_df(engine="polars")

source_pd_pa = data.cars.to_df(engine="pandas[pyarrow]")  # ? not sure if this is correct

For spatial data this can map to:

data.world.to_gdf(engine="geopandas")

I like the word engine more than backend, and it aligns with pandas, where this user-facing approach can still include the Loader class under the hood for all the magic.

dangotbanned commented 6 days ago

(https://github.com/vega/altair/pull/3631#issuecomment-2484826592), (https://github.com/vega/altair/pull/3631#issuecomment-2487526226)

I find the term Loader as user-facing API a bit confusing, since I feel I'm not loading urls. I like the current data-source-first approach

@mattijn

I'm not convinced that what you've proposed solves the stated problem.

To me, two much simpler options would be:

Rename Loader to Source
Add a url function to altair.datasets

The second option would complement the changes from (https://github.com/vega/altair/pull/3631#discussion_r1847176465) and (https://github.com/vega/altair/pull/3631/commits/7ddb2a8c1e8ec6477cfc646c385e0b168f2fd330) which @jonmmease and @joelostblom are happy with.

It would also open a path to a dataframe-package-free method of url retrieval (https://github.com/vega/altair/pull/3631#discussion_r1846662053). For that part I would like to wait until we know what changes these might have to urls:

Reference implementation

mattijn commented 4 days ago

I'm reviewing as an average user of Altair and for this use-case it is probably an associate professor who will need to update all her/his lecture materials at the evening before the semester starts.

What would it be great if we could say:

# old way (this is deprecated)
from vega_datasets import data

# new way (this will be awesome)
from altair.datasets import data

And everything else is still functioning. So this still works:

source_url = data.cars.url
source_pandas = data.cars()

But, the awesome thing that we provide with this PR is:

source_polars = data.cars(backend="polars")  # or `engine=`

Or polars with pyarrow dtypes:

source_pl_pa = data.cars(backend="polars[pyarrow]")  # or `engine=`

If it is like this than I'm fine with engine or backend as argument name. And than within this function we call the agnostic Loader using the dataset and backend choice. All in all, in my humble opinion, awesomeness.

dangotbanned commented 2 days ago

@jonmmease I just tried updating this branch, seems to be some vegafusion issues?

9d97096 (#3631)

Update

Resolved in #3702

vega / altair

feat(RFC): Adds `altair.datasets` #3631

Related

Description

Examples

Tasks

Resolved

Deferred

Reducing cache footprint

Investigate providing a decorator to add a backend

Investigate ways to utilize (https://github.com/vega/vega-datasets/blob/main/SOURCES.md)

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

`(Data|Lazy)Frame.filter(**constraints)`

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

Data types on read

Inferred Backend

Less setup code

1. Data types on read

Pandas

Polars

Non-dataframe data

2 & 3 Less setup code (e.g via the backend being optional)

1. Data types on read

Pandas

Standardise/replace SOURCES.md

Use a consistent date, time, datetime format

Switch to file formats that can encode data types

Edit

Reference implementation

Update

vega / altair

feat(RFC): Adds `altair.datasets` #3631

Related

Description

Examples

Tasks

Resolved

Deferred

Reducing cache footprint

Investigate providing a decorator to add a backend

Investigate ways to utilize (https://github.com/vega/vega-datasets/blob/main/SOURCES.md)

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

(Data|Lazy)Frame.filter(**constraints)

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

Data types on read

Inferred Backend

Less setup code

1. Data types on read

Pandas

Polars

Non-dataframe data

2 & 3 Less setup code (e.g via the backend being optional)

1. Data types on read

Pandas

Standardise/replace SOURCES.md

Use a consistent date, time, datetime format

Switch to file formats that can encode data types

Edit

Reference implementation

Update

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

`(Data|Lazy)Frame.filter(**constraints)`

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`