Open dangotbanned opened 1 month ago
@MarcoGorelli I thought now might be a good time to loop you in on this PR.
Don't be alarmed by the size 😅, the narwhals
stuff is just these two modules:
I'd appreciate any input really, but a few things I thought to mention:
(Data|Lazy)Frame.filter(**constraints)
This seems to be working for me (pandas
, pyarrow
, polars
), using purely the public narhwals
API:
Do you think something similar to BaseFrame._flatten_and_extract
but for BaseFrame.filter
and nw.expr.When
might fit into narwhals
?
I'm happy with this living in altair
since it is so simple, but that makes me question whether I've missed an obvious flaw.
ALTAIR_DATASETS_DIR
I've documented this in some detail at the end of the PR description, it intersects a few issues.
It seems like I've been spoiled by polars
, and having a hard time expressing the polars
-native solution via narwhals
.
Overall, I've found narwhals
really enjoyable to use.
You and all the other contributors have clearly put in a lot hard work into it 🎉
I really like this, thank you @dangotbanned! I'm looking forward to Altair's datasets being up to date, which will close a few long standing PRs and be of even more importance with all the upstream fixes that recently went into the Vega dataset repo. I'm suspecting this will also make it easier to maintain going forward than the previous approach, so great!
A few comments from looking over the diff briefly:
vega_datasets
package made some dtypes conversion on load, e.g. converted the 'Year'
column in cars to a date. It seems like it is read in as a string here based on the example in your comment. Should we add the same dtype conversions here as well or do you think it is better to skip them?with_backend()
be made optional and have a "smart default" based on a priority order and inspecting the available installed packages?One of the benefits of having the data sets as part of the altair package would be simplified imports and less setup code for using them. To go further in that direction, I wonder if it would be possible to simplify the user-facing API (or provide a shortcut), so that instead of the current:
data = alt.datasets.Loader.with_backend("polars")
cars = data("cars")
The shorter version could look something like:
cars = alt.datasets.load('cars')
This is similar to what seaborn (sns.load_dataset('cars')) and plotly (px.dataset.cars()) uses as well. Those packages don't have as advanced functionality with the different backends, urls, etc, but it would be great if it was possible to somehow include a convenient shorter syntax (at least for the most common cases). Potentially, the shorter syntax could still be compatible with the methods of Loader
or accept the most common ones as param names (such as with_backend
and url
)?
I really like this, thank you @dangotbanned! I'm looking forward to Altair's datasets being up to date, which will close a few long standing PRs and be of even more importance with all the upstream fixes that recently went into the Vega dataset repo. I'm suspecting this will also make it easier to maintain going forward than the previous approach, so great!
Thanks for taking a look @joelostblom, really appreciate the feedback 😄
I'm going to split each comment into a separate response, to try and give enough detail on each
@joelostblom
- The old
vega_datasets
package made some dtypes conversion on load, e.g. converted the'Year'
column in cars to a date. It seems like it is read in as a string here based on the example in your comment. Should we add the same dtype conversions here as well or do you think it is better to skip them?
I did see this, but I thought it could quickly get complex between versions.
movies
is probably the most common one we use that has a change in column names between versions.
I'd prefer if we could solve this in a more generalized way, that is easier to maintain.
For example, with polars
an issue is that read_json
has quite limited options - when compared to read_csv
.
You can work around this though, and for cars
you can see the difference here:
pl.read_json
```py import polars as pl >>> pl.read_json(fp_cars).schema.to_python() {'Name': str, 'Miles_per_Gallon': int, 'Cylinders': int, 'Displacement': float, 'Horsepower': int, 'Weight_in_lbs': int, 'Acceleration': float, 'Year': str, # <------------ 'Origin': str} ```
pl.read_json
-> pl.read_csv
Here we can simply roundtrip to get the dates parsed: ```py from io import BytesIO buf = BytesIO() pl.read_json(fp_cars).write_csv(buf) >>> pl.read_csv(buf, try_parse_dates=True).schema.to_python() {'Name': str, 'Miles_per_Gallon': int, 'Cylinders': int, 'Displacement': float, 'Horsepower': int, 'Weight_in_lbs': int, 'Acceleration': float, 'Year': datetime.date, # <------------ 'Origin': str} ```
I've done some testing, and this resolves many cases:
_pl_read_json_roundtrip
```py def _pl_read_json_roundtrip(source: Path | IOBase, /, **kwds: Any) -> pl.DataFrame: """ Try to utilize better date parsing available in `pl.read_csv`_. `pl.read_json`_ has few options when compared to `pl.read_csv`_. Chaining the two together - *where possible* - is still usually faster than `pandas.read_json`_. .. _pl.read_json: https://docs.pola.rs/api/python/stable/reference/api/polars.read_json.html .. _pl.read_csv: https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html .. _pandas.read_json: https://pandas.pydata.org/docs/reference/api/pandas.read_json.html """ from io import BytesIO import polars as pl df = pl.read_json(source, **kwds) if any(tp.is_nested() for tp in df.schema.dtypes()): # NOTE: Inferred as `(Geo|Topo)JSON`, which wouldn't be supported by `read_csv` return df buf = BytesIO() df.write_csv(buf) if kwds: SHARED_KWDS = {"schema", "schema_overrides", "infer_schema_length"} kwds = {k: v for k, v in kwds.items() if k in SHARED_KWDS} return pl.read_csv(buf, try_parse_dates=True, **kwds) ```
But there are some non-standard date formats that require manual intervention, e.g:
data("stocks").with_columns(pl.col("date").str.to_date("%b %-d %Y"))
pandas.read_json
seems to have more options - I'm not that familiar with them though.
I should note though that you can pass in arguments to data(..., **kwds)
:
@joelostblom
- Could
with_backend()
be made optional and have a "smart default" based on a priority order and inspecting the available installed packages?
The main issue with this is that it would break IDE completions. Currently, you get the correct completions based on the initialized backend.
https://github.com/user-attachments/assets/2e2fa3c5-5f45-4ae6-9997-c9ce38dadc7b
![image](https://github.com/user-attachments/assets/c09b3b89-17b4-40da-951d-4d4411f99a9a)
![image](https://github.com/user-attachments/assets/ee2d6bfe-4d02-418c-8eba-2a80380cadab)
This can be statically known since backend_name
is used to match an @overload
.
AFAIK, inferring based on installed packages would be too dynamic. Meaning we either return some opaque (Any
) object - or say that every data(...)
returns one of any of the supported types.
I'm open to ideas on this, but the options I can see seem like a worse user experience.
You only need to initialize it once anyway, so it isn't too different from how you'd use the various PluginRegistry
classes
@joelostblom
One of the benefits of having the data sets as part of the altair package would be simplified imports and less setup code for using them. To go further in that direction, I wonder if it would be possible to simplify the user-facing API (or provide a shortcut), so that instead of the current:
data = alt.datasets.Loader.with_backend("polars") cars = data("cars")
The shorter version could look something like:
cars = alt.datasets.load('cars')
So, a lot of what I touched on in https://github.com/vega/altair/pull/3631#issuecomment-2480677838 would apply here.
I do agree though that we can shorten things a little bit.
An easy one would be loader = Loader.with_backend
:
loader
*We could alternatively define it as a function* ![image](https://github.com/user-attachments/assets/a958f5d1-eef9-43c8-900b-a9809611faaf)
Another option, which I suppose is currently an easter egg:
from altair.datasets import data
https://github.com/vega/altair/blob/f2823b42c1cd7d4d84a94845a8ccb0d6767756ce/altair/datasets/__init__.py#L323-L327 ![image](https://github.com/user-attachments/assets/46046e83-2d73-481b-b763-cbdb24dcb46c) The above will raise an import error if `pandas` isn't installed though.
cars = alt.datasets.load('cars')
load
specifically is an interesting one.
I'd had thoughts about adding a Loader.(json|load)
method - to provide similar functionality to (US_10M|World_110M)
load
would then align with json.load
This is similar to what seaborn (sns.load_dataset('cars')) and plotly (px.dataset.cars()) uses as well. Those packages don't have as advanced functionality with the different backends, urls, etc, but it would be great if it was possible to somehow include a convenient shorter syntax (at least for the most common cases). Potentially, the shorter syntax could still be compatible with the methods of
Loader
or accept the most common ones as param names (such aswith_backend
andurl
)?
While I was working on this I looked around for inspiration, finding sklearn.datasets. Quite quickly though I abandoned that thread due to how complex I found it.
Thanks for linking seaborn
and plotly
, they are much simpler and more what I was looking for earlier.
Interesting timing however, plotly
now has backend support as of 3 days ago:
It uses narwhals
as well (thanks to @FBruzzesi), and works in a pretty similar way to what I've done (minus the typing).
The key part is there still needs to be an explicit return_type
- which is used for an import.
@joelostblom for the user-side, this means a polars
user of plotly
writes:
import plotly.express as px
px.data.iris(return_type="polars")
px.data.gapminder(return_type="polars")
px.data.stocks(return_type="polars")
Whereas the same user in altair
writes:
from altair.datasets import Loader
data = Loader.with_backend("polars")
data("iris")
data("gapminder")
data("stocks")
So although the entry requires a few extra characters, everything afterwards is shorter. This can add up when you consider how many datasets we have available:
alt.datasets._typing.Dataset
https://github.com/vega/altair/blob/f2823b42c1cd7d4d84a94845a8ccb0d6767756ce/altair/datasets/_typing.py#L22-L97
Thanks for the detailed responses! I will break up my replies too to keep them more readable.
I agree that it would be great to solve this in a generalized way if possible. For pandas, it seems like there is no way to do this automatically for CSV files unfortunately. I also tried convert_dtypes and infer_objects, but it seems like none of the will identify and convert dates.
For json it seems like there is a flag in read_json to parse all date like formats automatically, so we could roundtrip to json from CSV if it is not too time consuming. This flag is enabled by default and seems to work well in my testing, e.g.
url = 'https://cdn.jsdelivr.net/npm/vega-datasets@v2.9.0/data/unemployment-across-industries.json'
pd.read_json(url).dtypes
series object
year int64
month int64
count int64
rate float64
date datetime64[ns, UTC] <--
Turning it off causes date
to be parsed as an object as expected:
pd.read_json(url, convert_dates=False).dtypes
series object
year int64
month int64
count int64
rate float64
date object <--
However, some columns that are not considered "date-like" by read_json
were previously converted to dates, e.g. the integer Year
column in the cars
dataframe:
url = 'https://cdn.jsdelivr.net/npm/vega-datasets@v2.9.0/data/cars.json'
pd.read_json(url).dtypes
Name object
Miles_per_Gallon float64
Cylinders int64
Displacement float64
Horsepower float64
Weight_in_lbs int64
Acceleration float64
Year object <--
Origin object
The vega_datasets packages converted these to dates:
from vega_datasets import data
data.cars().dtypes
Name object
Miles_per_Gallon float64
Cylinders int64
Displacement float64
Horsepower float64
Weight_in_lbs int64
Acceleration float64
Year datetime64[ns] <--
Origin object
I am not sure this conversion is the right thing to todo since they really are integers and 1970
is not the same as 1970-01-01
. I believe this was done to make the formatting on the axis look better (ie not be formatted as 1,970
), but maybe there are better ways to do this now, e.g. using one of the alternatives suggested here https://github.com/vega/altair/discussions/3140, especially if Vega eventually changes this behavior as per https://github.com/vega/vega/issues/1681
I am less familiar with polars (although I have been using it more lately and enjoy it!), but your suggestion with round-tripping to csv seems reasonable, unless that is very time consuming (e.g. for the large flights dataset it might take long? or does the lazy loading/executing make this less of an issue?).
What about the US10M and Worl_110M datasets? They are both loaded as dictionaries rather than dataframes in vega_datasets, so we would probably need to use special logic for those regardless I suppose.
I will reply to 2 and 3 together as my main concern is to have a shorter path of access to the sample data. The automatic type inference you have enabled via the current Loader.with_backend()
approach is really cool and I want to make it clear that I'm not suggesting to remove it. I think we should keep it for the part of the user base that could take advantage of this effectively.
My suggestion with a shortcut that requires less setup, is mainly to cater towards users who are coming from more of a data analytics/science background, but are not as well-versed in programming/software. The focus here might be more on just getting some data to try things out versus understanding what it means to use a loader class and specify a backend. I also think the altair datasets are often used for demo purpose where it is convenient to have less typing and demonstrate how easy it is to access sample data.
The main issue with this is that it would break IDE completions.
Is it possible to design things so that this is left up to the user? With that I mean that if they want IDE completions, they need to specify a backend, but if they don't, they can still load in the data? Or is this not possible with the @overload
design?
We do actually also have the native altair.Data
class. Technically this could maybe be used as the default backend requiring no additional packages installed. However, no data type inference for columns would happen in this case, so this has no user-facing practical advantage over just passing the URL to the spec, since the type suffixes would be required on the column names in the altair spec in either case, and the data would look like a dictionary when displayed in the IDE.
Another option, which I suppose is currently an easter egg:
from altair.datasets import data
What a convenient shortcut! Since this already does not involve any IDE completions, could we infer which dataframe package to use here before passing it to Loader.with_backend()
on line 326? And advertise approach as a shortcut that is not as flexible/powerful as using the loader directly but sufficient in cases where users just want to get their hands on some data to try things out (so remove the deprecation warning). Possibly also changing the name to load
instead of data
since it is already within the datasets
namespace?
Small note: I noticed that when I try alt.datasets.Loader.with_backend("pandas")('cars')
, I get an error saying that I need parquet support installed although there the cars file is json and there is no parquet reading needed.
Interesting timing however, plotly now has backend support as of 3 days ago:
Indeed interesting timing! I agree with you that there is not much difference when importing multiple package and using a non-default backend. However, a px user does have access to simply typing px.data.stocks()
if they don't care about which backend to use. Currently this default case is much longer in altair with alt.datasets.Loader.with_backend('pandas')('stocks')
(alternatively broken up over multiple lines with a separate import as in your examples). I think it could be helpful to have a alt.datasets.load('stocks')
as per the above or even alt.datasets.stocks()
.
A new separate point: I notice in the docstring for the url class that it requires a backend to be set (or at least it seems like that from the example). Would it be possible to make the URL accessible without having any dataframe package installed since Altair can be used directly with the URL string even if pandas/polars etc are not installed?
Thanks for working on this @dangotbanned. I haven't looked deeply into the implementation, but wanted to share a few thoughts related to the API.
The most common workflow I've seen for the vega_datasets
module is to import it on one line, and then load a dataset on another line.
from vega_datasets import data
source = data.movies()
I think we should strive to make this two line pattern as clear and concise as possible, and I feel like the paradigm of creating a loader object and using that to load the dataset makes this a little more complex. I'm definitely aligned with the desire to allow the type system to propagate the dataset names and DataFrame types, but I'd be interested in talking about the API a bit.
I would personally lean toward the return_type
style you mentioned that plotly express is using, and have the default value of the argument be pandas
. Another approach, that might be a little unorthodox, would be to select the backend type as part of the import.
So for pandas:
from altair.datasets.pd import data
df = data("movies")
or for polars
from altair.datasets.pl import data
df = data("movies")
I think what you have is very close to this, but we'd build the loader for the particular backend on import. Compared to the return_type
flow, this has the advantage of not requiring you to pass the backend argument repeatedly for the case where you are loading multiple datasets.
Thanks for all the feedback everyone!
@jonmmease (https://github.com/vega/altair/pull/3631#issuecomment-2480853741) @joelostblom (https://github.com/vega/altair/pull/3631#issuecomment-2480816377, https://github.com/vega/altair/pull/3631#issuecomment-2480832609, https://github.com/vega/altair/pull/3631#issuecomment-2480832711) @mattijn (https://github.com/vega/altair/discussions/3150#discussioncomment-11280516)
I'm a bit overwhelmed on where to start with responses, but please know that I have read through everything you guys have had to say (at least once 😅) and really appreciate it all.
I will try my best to answer/feed-back early next week. In the meantime, there may be things that I can clarify/give context for by linking to various docstrings in the PR.
Some generally useful bits, either for docs or to try running locally:
A RFC well done🥳. I like the last approach as suggested by @jonmmease. It's concise and clear. Btw, I can imagine that implementing support for previous tag-versions of the Vega-datasets is useful during development, but would advocate to implement a single version (the most recent) as static constant. Similar to what we have for the Vega, Vega-Lite and Vega-Embed versions.
A RFC well done🥳.
Thanks @mattijn
Btw, I can imagine that implementing support for previous tag-versions of the Vega-datasets is useful during development, but would advocate to implement a single version (the most recent) as static constant. Similar to what we have for the Vega, Vega-Lite and Vega-Embed versions.
This one I think I have a short answer for!
One thing to consider is what does the transition from vega_datasets
-> altair.datasets
look like for us?
We're 4 years behind the current version, and a sudden change will break a lot of the docs.
Off the top of my head, movies
has a lot of renames and some datasets like iris
were removed.
If we have the tag
available, we could use that as a pin for any specific examples that stop working.
I think that could smoothen things out until we get everything else caught up - and make it possible to split up the work over time.
Now I'm thinking about it, I don't think I wrote this down anywhere 🤦♂️ - but in my head I thought it would make getting this PR merged easier
@joelostblom
Thanks for the detailed responses! I will break up my replies too to keep them more readable.
1. Data types on read
Pandas
I agree that it would be great to solve this in a generalized way if possible. For pandas, it seems like there is no way to do this automatically for CSV files unfortunately. I also tried convert_dtypes and infer_objects, but it seems like none of the will identify and convert dates.
Thanks for investigating the pandas
-side of things.
I think there are a few things we could do upstream in https://github.com/vega/vega-datasets to make this easier for us to get consistent results across dataframe packages. These ideas aren't all mutually-exclusive, but some may remove the need for others:
For this task specifically, having a schema/format string listed for all the datasets would be helpful. Especially when something isn't ISO 8601, e.g:
Ideally this would be defined in a tabular format or JSON - since we could extract that data more reliably than from markdown.
There is a history of various datetime formats changing to resolve parsing errors:
Definitely less likely to be accepted (https://github.com/vega/vega-datasets/pull/628#issuecomment-2476299461) than the other two, but would be the most reliable and fastest (for python
) option.
I can't seem to get the issue comments to load, but these demonstrate using parquet
Probably blocked until (https://github.com/vega/vega/issues/3961) is resolved, sadly
Switch to file formats that can encode data types
This sounds like the ideal option to me, great idea! Rather than "switch", maybe we could just add a parquet file for each existing data set in addition to the current file types? Then we could load the parquet with specified dtypes by default in altair, but it would still be possible to load other formats for demonstration purposes, and no existing VL examples would break or need updating. Over time, we could probably reduce the number of datasets that provide non-parquet formats to reduce the overall size of the package/repo.
One thing to consider is what does the transition from vega_datasets -> altair.datasets look like for us?
Good point. I would say, for this PR, we will use the tagged dataset version that is close to the currently used version in our documentation. This way, it will almost not break any examples in the docs and can be added as a feature in a minor release (5.5).
As a follow-up, we can make a PR to update the tagged dataset version to the latest. That change will need updates to all our documentation. I think it makes sense to do this in a major release (6.0), but I am also open to do this in a minor release.
https://github.com/vega/altair/pull/3631#issuecomment-2481605587
Switch to file formats that can encode data types
https://github.com/vega/altair/pull/3631#issuecomment-2481647641
This sounds like the ideal option to me, great idea! Rather than "switch", maybe we could just add a parquet file for each existing data set in addition to the current file types? Then we could load the parquet with specified dtypes by default in altair, but it would still be possible to load other formats for demonstration purposes, and no existing VL examples would break or need updating. Over time, we could probably reduce the number of datasets that provide non-parquet formats to reduce the overall size of the package/repo.
@joelostblom
Apologies, I used the wrong term there, your suggestion is actually what I proposed in:
Also there was this (https://github.com/vega/vega-datasets/issues/627#issuecomment-2481321593) which has some extra related context.
So I think we're both in agreement that this would be the best path forward. But we would need @domoritz on board as well (https://github.com/vega/vega-datasets/issues/627#issuecomment-2476410307)
Let me know if there is anything you need from the vega-daatsets repo side. Would it make sense for some logic to live there so that it gets updated if we add a new datasets for example?
@domoritz Would it be reasonable to have a parquet (or other binary format) file available for each of the vega datasets? The primary benefit for us would be that the column data types are stored with the data and we would not need to inclue custom column dtype conversion logic to e.g. turn strings into dates (which we used to have in the old vega datasets python package)
Hmm, would you want to have that in the npm package or just the repo?
I find the term Loader as user-facing API a bit confusing, since I feel I'm not loading urls. I like the current data-source-first approach
The current approach:
from vega_datasets import data
source_url = data.cars.url
source_pandas = data.cars()
Could we consider something like this instead?
from altair.datasets import data
# from vega_datasets import data # DeprecationError: update import
source_url = data.cars.url
source_pandas = data.cars.to_pandas()
source_polars = data.cars.to_polars()
# source_pandas = data.cars() # DeprecationWarning: be explicit, use updated syntax
This approach would allow .to_pandas()
to handle any backend-specific nuances, such as engine="pyarrow"
or spatial="geopandas"
.
Regarding backend=pandas["pyarrow"]
, I found it a bit unclear, especially since this option isn’t mentioned in the pandas optional dependencies installation guide. The pandas user guide uses engine="pyarrow"
, so opting for this for consistency.
Edit: The Loader class can still be used as base class under the hood.
Would it be reasonable to have a parquet (or other binary format) file available for each of the vega datasets? The primary benefit for us would be that the column data types are stored with the data and we would not need to inclue custom column dtype conversion logic to e.g. turn strings into dates (which we used to have in the old vega datasets python package)
@joelostblom
Would a less-heavy compromise be doing this for a subset of the datasets?
While I was looking for test cases for (https://github.com/vega/altair/pull/3631#discussion_r1845658286) - it seemed like we might not need this for all of them
test_polars_read_json_roundtrip
https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/tests/test_datasets.py#L463-L481
@domoritz would you feel more comfortable with this if we were talking about either:
All datasets with a ...
pandas
!= polars
, etc)Hmm, would you want to have that in the npm package or just the repo?
@domoritz
If it were just resources that we'd be bundling with altair
, the GitHub
repo option would work.
We can support that with very few changes to https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/tools/datasets/github.py
However, the main benefit that npm
brings for this use case is the lack of rate limiting or need for an auth token.
For GitHub
, this would be limited to:
I think hitting the unauthenticated limit would probably be quite easy to do in a notebook environment. CI workflows may also face issues, even with the higher limit if the project is large enough.
This could be mitigated by opting-in to caching, but it could be frustrating to encounter the rate limit as a beginner.
Loader.cache_dir
https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/altair/datasets/__init__.py#L290-L311
@joelostblom Would a less-heavy compromise be doing this for a subset of the datasets?
While I was looking for test cases for (https://github.com/vega/altair/pull/3631#discussion_r1845658286) - it seemed like we might not need this for all of them test_polars_read_json_roundtrip
@domoritz would you feel more comfortable with this if we were talking about either:
All datasets with a ...
- format string column (usually temporal data type)
- format string column, which this PR can demonstrate has inconsistent parsing (e.g. pandas != polars, etc)
- non-ISO 8601 format string column
Yup, this sounds like a good compromise good to me.
I'm supportive of using proper date types and having parquet files in the Vega datasets repo. I'd prefer not to have the parquet files in npm.
I'm supportive of using proper date types and having parquet files in the Vega datasets repo. I'd prefer not to have the parquet files in npm.
@domoritz is there a solution where we can use GitHub
via jsdelivr
?
https://www.jsdelivr.com/documentation#id-github
I'd been confused by what use-case this endpoint would serve that wasn't available via get-a-blob
I haven't found any mention of rate limits there, so maybe this is a motivating scenario?
In the Vega editor, we download releases btw https://github.com/vega/editor/blob/92d95435062bb8d0b53fe86772432f5eaef51719/scripts/vendor.sh#L19. Do you know whether there are rate limits on releases?
In the Vega editor, we download releases btw vega/editor@
92d9543
/scripts/vendor.sh#L19. Do you know whether there are rate limits on releases?
@domoritz
If I've read that correctly, that would be what I'm referring to as tags
- so i think so?
The rate limit endpoint is free - so you could try it before and after to test?
But the jsdelivr one seems to be something different - while still effectively accessing the same resource?
GitHub
seems to be accessible using:
Whereas npm
looks like:
I did a quick test. Running
gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
/rate_limit
before and after yarn
in the Vega editor. Nothing changed about remaining. But I am also just downloading via my ip address so I don't see why this would be measured in the rate limits.
jsdelivr caches as far as I understand.
I think there are plenty of options for downloading so I woud be surprised if we are locking ourselves into something
I think there are plenty of options for downloading so I woud be surprised if we are locking ourselves into something
@domoritz yeah I think we should go ahead with this route!
While this is being considered, I feel the need to bring up pyarrow
& .json
datasets.
I'm doing some gymnastics in that backend to account for only line-delimited JSON being natively supported:
_PyArrowReader
https://github.com/vega/altair/blob/7b3a89e5b5374eb391b7ae73ace219327069f979/altair/datasets/_readers.py#L351-L411
From a quick search, that accounts for 34 datasets.
If we were able to access these as .parquet
- then pyarrow
could read everything except the spatial datasets natively.
I know this changes the scope of the request, but I thought I'd ask anyway 😅 All of the backends would benefit from the performance and data types as a bonus.
Maybe better is something that can be explained as a "source that is expressed as dataframe or url".
from altair.datasets import data
source_pandas = data.cars.to_df(engine="pandas")
# data.cars.to_df() # default `pandas`
# data.cars.url
source_polars = data.cars.to_df(engine="polars")
source_pd_pa = data.cars.to_df(engine="pandas[pyarrow]") # ? not sure if this is correct
For spatial data this can map to:
data.world.to_gdf(engine="geopandas")
I like the word engine
more than backend
, and it aligns with pandas, where this user-facing approach can still include the Loader class under the hood for all the magic.
(https://github.com/vega/altair/pull/3631#issuecomment-2484826592), (https://github.com/vega/altair/pull/3631#issuecomment-2487526226)
I find the term Loader as user-facing API a bit confusing, since I feel I'm not loading urls. I like the current data-source-first approach
@mattijn
I'm not convinced that what you've proposed solves the stated problem.
To me, two much simpler options would be:
Loader
to Source
url
function to altair.datasets
The second option would complement the changes from (https://github.com/vega/altair/pull/3631#discussion_r1847176465) and (https://github.com/vega/altair/pull/3631/commits/7ddb2a8c1e8ec6477cfc646c385e0b168f2fd330) which @jonmmease and @joelostblom are happy with.
It would also open a path to a dataframe-package-free method of url retrieval (https://github.com/vega/altair/pull/3631#discussion_r1846662053). For that part I would like to wait until we know what changes these might have to urls:
I'm reviewing as an average user of Altair and for this use-case it is probably an associate professor who will need to update all her/his lecture materials at the evening before the semester starts.
What would it be great if we could say:
# old way (this is deprecated)
from vega_datasets import data
# new way (this will be awesome)
from altair.datasets import data
And everything else is still functioning. So this still works:
source_url = data.cars.url
source_pandas = data.cars()
But, the awesome thing that we provide with this PR is:
source_polars = data.cars(backend="polars") # or `engine=`
Or polars with pyarrow dtypes:
source_pl_pa = data.cars(backend="polars[pyarrow]") # or `engine=`
If it is like this than I'm fine with engine
or backend
as argument name. And than within this function we call the agnostic Loader using the dataset and backend choice. All in all, in my humble opinion, awesomeness.
@jonmmease I just tried updating this branch, seems to be some vegafusion
issues?
Resolved in #3702
Related
Description
Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.
This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:
metadata.parquet
npm
andgithub
pandas
"polars"
backend, the slowest I've had on a cache-hit is 0.1s to loadExamples
These all come from the docstrings of:
Loader
Loader.from_backend
Loader.__call__
Tasks
altair.datasets
Loader.__call__
_readers._Reader
_typing.Metadata
(Align with revised descriptions from (https://github.com/vega/altair/pull/3631/commits/c572180ebc7d876714a38688c53f7e4af87abd93))tools.datasets
Application
models.py
github.py
npm.py
semver.py
_PyArrowReader
JSON limitation (https://github.com/vega/altair/pull/3631/commits/3fbc759233fdf0203a2f8685245152732f57276a), (https://github.com/vega/altair/pull/3631/commits/4f5b4de6d894a1297bd2edfaecb72c5eefa48bc7)Resolved
Investigate bundling metadata
- Investigating bundling metadata (https://github.com/vega/altair/pull/3631/commits/22a50396822dc48d4ed63bae3c8837dc28dab6ad), (https://github.com/vega/altair/pull/3631/commits/17923404866003e27a510be793ab65c290d8802a) - Depending on how well the compression scales, it might be reasonable to include this for some number of versions - Deliberately including redundant info early on - can always chip away at it later
npm
does not have every version availableGitHub
- Sources - [npm/vega-datasets](https://cdn.jsdelivr.net/npm/vega-datasets/) - **Fixed with:** https://data.jsdelivr.com/v1/packages/npm/vega-datasets - https://github.com/vega/vega-datasets/tags - Known missing - https://github.com/vega/vega-datasets/releases/tag/v1.9.0 - https://github.com/vega/vega-datasets/releases/tag/v1.6.0 - https://github.com/vega/vega-datasets/releases/tag/v1.4.0 - [feat(DRAFT): Add a source for available npm versions](https://github.com/vega/altair/pull/3631/commits/0bbf2e9ec2ff2f1d79b4d4f68128625daab2d947) - Need to add some handling to invalidate these entries returned from [list-repository-tags](https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags) once confirmed they cannot be requested from `npm` - Can technically request from `github`, but during testing this was much slower - Also, these versions **would not** have been available from https://github.com/altair-viz/vega_datasets, since that only used `npm`
Plan strategy for user-configurable dataset cache
- Everything so far has been building the tools for a compact bundled index - [1](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/vendor_datasets.py#L394), [2](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/metadata_full.parquet), [3](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/metadata_full-schema.json), [4](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/tags.parquet), [5](https://github.com/vega/altair/blob/686a48599f86cffb49549d72e697c88aa4440d45/tools/_vega_datasets_data/tags-schema.json) - Refreshing the index would not be included in `altair`, each release would simply ship with changes **baked in** - Trying to avoid bloating `altair` package size with datasets - **User-facing** - Goal of requesting each unique dataset version **once** - The user cache would not need to be updated between `altair` versions - Some kind of **opt-in** config to say *store the datasets in this directory please* - Basic solution would be defining an env variable like `ALTAIR_DATASETS_DIR` - When not provided, always perform remote requests - User motivation would be that it would be faster to enable caching
Deferred
Reducing cache footprint
.(csv|tsv|json)
files as.parquet
Investigate providing a decorator to add a backend
_name: LiteralString
_read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
_scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]
Investigate ways to utilize (https://github.com/vega/vega-datasets/blob/main/SOURCES.md)
Provide more meaningful info on the state of
ALTAIR_DATASETS_DIR
sha
cover?nw.Expr.(first|last)
nw.Expr.(head|tail)(1)
not equivalent in agroup_by().agg(...)
contextpandas
-> scalarpolars
-> listpl.Enum
translating to non-orderedpd.Categorical
polars
-native solution```py from __future__ import annotations from pathlib import Path import polars as pl from altair.datasets import Loader, _readers data = Loader.from_backend("polars") # NOTE: Enable caching, populate with some responses data.cache_dir = Path.home() / ".altair_cache" data("cars") data("cars", tag="v1.5.0") data("movies") data("movies", tag="v1.24.0") data("jobs") if cache_dir := data.cache_dir: cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir()) else: msg = "Datasets cache unset" raise TypeError(msg) # NOTE: Lots of redundancies, many urls point to the same data (sha) >>> pl.read_parquet(_readers._METADATA).shape # (2879, 9) # NOTE: Version range per sha tag_sort: pl.Expr = pl.col("tag").sort() tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ") # NOTE: Producing a name only when the file is already in the cache file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then( pl.concat_str("sha", "suffix") ) cache_summary: pl.DataFrame = ( pl.scan_parquet(_readers._METADATA) .group_by("dataset_name", "suffix", "size", "sha") .agg(tag_range=tag_range) .select(pl.exclude("sha"), file_name=file_name) .sort("dataset_name", "size") .collect() ) >>> cache_summary.shape # (116, 5) >>> cache_summary.head(10) ``` ``` shape: (10, 5) ┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐ │ dataset_name ┆ suffix ┆ size ┆ tag_range ┆ file_name │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ str ┆ str │ ╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡ │ 7zip ┆ .png ┆ 3969 ┆ v1.5.0 - v2.10.0 ┆ null │ │ airports ┆ .csv ┆ 210365 ┆ v1.5.0 - v2.10.0 ┆ 608ba6d51fa70584c3fa1d31eb9453… │ │ annual-precip ┆ .json ┆ 266265 ┆ v1.29.0 - v2.10.0 ┆ null │ │ anscombe ┆ .json ┆ 1703 ┆ v1.5.0 - v2.10.0 ┆ null │ │ barley ┆ .json ┆ 8487 ┆ v1.5.0 - v2.10.0 ┆ 8dc50de2509b6e197ce95c24c98f90… │ │ birdstrikes ┆ .csv ┆ 1223329 ┆ v2.0.0 - v2.10.0 ┆ null │ │ birdstrikes ┆ .json ┆ 4183924 ┆ v1.5.0 - v1.31.1 ┆ null │ │ budget ┆ .json ┆ 374289 ┆ v1.5.0 - v2.8.1 ┆ null │ │ budget ┆ .json ┆ 391353 ┆ v2.9.0 - v2.10.0 ┆ null │ │ budgets ┆ .json ┆ 18079 ┆ v1.5.0 - v2.10.0 ┆ 8a909e24f698a3b0f6c637c30ec95e… │ └───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘ ```