Open maximlt opened 4 months ago
Great write-up!
'hvsampledata' would make it a bit more clear that this is data not about any holoviz thing, but a sample dataset to work with along side holoviz things. Although it's more of a mouthful, 'hvsampledata' would show up after hvplot in your editor :)
I like hvsampledata too.
I'm wondering about the API options:
df = hvplot.datasets.penguins()
df = hvplot.datasets.get_penguins()
df = hvplot.datasets.penguins # property
df = hvplot.datasets.penguins.load()
df = hvplot.datasets.load("penguins")
df = hvplot.datasets.use("penguins").load() # consider lazy load for some example use cases
I'd also like to encourage discoverability within the IDE with autocomplete, e.g.
hvplot.datasets.penguins()
is easier to discover hvplot.datasets.load("penguins")
I am 100% in favor of streamlining and simplifying how we access datasets. However, I'm not sure how the proposed datasets package would work, given the size of the datasets we use, e.g. for Datashader examples. Are you proposing that all the datasets would actually be stored within a single conda package? Seems like that would be an impractically large conda package. Maybe you could list the number and size of the datasets you'd expect to be included there?
Also, adding a new package is problematic because it would then need to go onto Anaconda Defaults for a defaults user to be able to get the examples, presumably? HoloViz tools in general are on defaults, which means anything they require also needs to be there. I don't think defaults is going to want a huge package, nor to add another package without a good reason.
Maybe we could split the difference and have a package on conda-forge but then have HoloViews have a function like the one in Bokeh, first checking if the datasets package is installed and if not, downloading that data directly from the internet?
.csv.gz
instead of .csv
or similar.To be honest I actually prefer a web site describing a set of datasets and their urls instead of a package. When I was a newbee I found those package a bit lit of hocus pokus. Probably because in the old days there was no description of the dataset and no type annotations. I.e. you did not really know what you got.
@jbednar, the small datasets will be in the package, but larger datasets will need to be downloaded and cached.
Also, a +1 on hvsampledata
My opinion is that the packages should have no dependencies at all. And let it be up to the other packages to specify the packages needed along with the hvsampledata
. Currently, pandas are a dependency of most of our libraries, but this could change in the future. An option could also be to have a backend
keyword, which has a default depending on what library you have installed, and add the option to specify it if needed: hvplot.datasets.penguins(backend="polars")
.
I like hvsampledata
too.
However, I'm not sure how the proposed datasets package would work, given the size of the datasets we use, e.g. for Datashader examples. Are you proposing that all the datasets would actually be stored within a single conda package? Seems like that would be an impractically large conda package.
I'm thinking about shipping the typically small datasets that are used on the website of hvPlot or Panel, like iris, penguins, gapminders, air_temperature, etc. But the package can expose larger datasets that will be downloaded on the fly, and cached.
Maybe you could list the number and size of the datasets you'd expect to be included there?
Yes this list is needed. Hm maybe each project maintainer could come up with their wishlist?
Also, adding a new package is problematic because it would then need to go onto Anaconda Defaults for a defaults user to be able to get the examples, presumably? HoloViz tools in general are on defaults, which means anything they require also needs to be there. I don't think defaults is going to want a huge package, nor to add another package without a good reason.
For what it's worth, vega_datasets
is on defaults (https://anaconda.org/anaconda/vega_datasets). I hope it's not going to be too hard to convince them to add a new package that doesn't require much maintenance. Most HoloViz users are anyway getting their package from PyPI or conda-forge, we can serve those without a problem.
For the larger datasets:
hvsampledata
would first try to load the large dataset from this package, and fall back to downloading it from the web if the large data package is not installed. It would be an optional dependency that would help air-gapped users.hvsampledata
could expose functions that return these dummy datasets, for us to avoid having to copy/paste this code in multiple places.num=10000
np.random.seed(1)
dists = {cat: pd.DataFrame(dict([('x',np.random.normal(x,s,num)),
('y',np.random.normal(y,s,num)),
('val',val),
('cat',cat)]))
for x, y, s, val, cat in
[( 2, 2, 0.03, 10, "d1"),
( 2, -2, 0.10, 20, "d2"),
( -2, -2, 0.50, 30, "d3"),
( -2, 2, 1.00, 40, "d4"),
( 0, 0, 3.00, 50, "d5")] }
df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category")
To be honest I actually prefer a web site describing a set of datasets and their urls instead of a package.
Bokeh does a pretty good job at showing what the datasets contain: https://docs.bokeh.org/en/latest/docs/reference/sampledata.html. By making the datasets available through a callable, we can add in their docstring a link to the site so you can quickly have access to more information.
Currently, pandas are a dependency of most of our libraries, but this could change in the future. An option could also be to have a backend keyword, which has a default depending on what library you have installed, and add the option to specify it if needed: hvplot.datasets.penguins(backend="polars").
Yep, that's why I like to make the datasets available through a callable, it gives us some more flexibility.
Thanks for the excellent writeup @maximlt!
My opinion is that the packages should have no dependencies at all. And let it be up to the other packages to specify the packages needed
I agree with @Hoxbro on both counts: I think this package should be free to serve any format and any package using it should decide which ones they care about (which will probably be listed as dependencies already anyway).
I'm thinking about shipping the typically small datasets that are used on the website of hvPlot or Panel, like iris, penguins, gapminders, air_temperature, etc. But the package can expose larger datasets that will be downloaded on the fly, and cached.
Makes sense to me!
if we really want, we could have another package that ships larger datasets. hvsampledata would first try to load the large dataset from this package, and fall back to downloading it from the web if the large data package is not installed. It would be an optional dependency that would help air-gapped users.
Yes, good idea. This is an optional extension we should plan for even if we have trouble publishing such a large package. Note that maybe we don't necessarily need to publish a large data package: we could get the package to air-gapped users some other way if they need it.
As for the naming, I hate all the names so I won't get involved. :-)
I suppose a long and ugly (but descriptive) name like hvsampledata
is probably going to be the best we can do.
+1 on hvsampledata
+5 on hvsampledata
I initially had a concern that this doesn't say "holoviz", so that people looking at it in their environments or in the packages included in Anaconda Distribution will have no idea what it might be (whereas vega_datasets
is clearly related to vega
).
But I think someone pointed out above that it would sort next to hvplot
, which will nearly always be in the packages list since (a) conda defaults includes hvplot, and (b) we steer nearly everyone to install hvplot
rather than holoviews
. So my guess is if they see hvplot
and hvsampledata
they might well guess that this is to do with hvplot
.
The alternatives I'd imagine are hvdatasets
(maybe no better than hvsampledata
) and a name that doesn't have hv
in it at all, documenting it as a handy collection of sample data for any project to use, not just HoloViz.
a name that doesn't have hv in it at all, documenting it as a handy collection of sample data for any project to use, not just HoloViz.
I wouldn't be so much in favor of that, to avoid increasing our maintenance burden. Besides that, it sounds like you're not against hvsampledata
, even if ideally you'd prefer if holoviz
was in the name (holoviz_sampledata
?). We can decide on the name in one of the next HoloViz meetings.
Next, I'll submit a list of the most used datasets currently across the HoloViz websites, to figure out which lightweight packages we should expose.
So after some not-perfect regex + pandas, here's the breakdown per package of the types/name of the datasets used:
geoviews:
type target
bokeh.sampledata airport_routes 4
pandas '../../assets/cities.csv', 4
xarray.tutorial rasm 4
pandas '../../assets/referendum.csv')\n" 2
'../assets/referendum.csv')\n" 2
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
holoviews:
type target
bokeh.sampledata iris 10
autompg 7
pandas 'http://assets.holoviews.org/macro.csv', 7
bokeh.sampledata airport_routes 6
stocks 6
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
hvplot:
type target
xarray.tutorial air_temperature 24
bokeh.sampledata penguins 12
iris 8
autompg 8
hvplot us_crime 5
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
lumen:
type target
pandas cache_path, 1
path), 1
url).sample(5).reset_index(drop=True).to_csv( 1
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
panel:
type target
bokeh.sampledata autompg 11
population 7
pandas "https://assets.holoviz.org/panel/tutorials/turbines.csv.gz" 7
bokeh.sampledata airport_routes 3
datasets.holoviz '[https://datasets.holoviz.org/penguins/v1/penguins.csv')\n](https://datasets.holoviz.org/penguins/v1/penguins.csv')/n)" 3
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
And the 15 most used datasets:
type target
xarray.tutorial air_temperature 26
bokeh.sampledata autompg 26
iris 21
airport_routes 14
stocks 13
penguins 12
population 7
pandas 'http://assets.holoviews.org/macro.csv', 7
"https://assets.holoviz.org/panel/tutorials/turbines.csv.gz" 7
xarray.tutorial rasm 7
bokeh.sampledata movies_data 5
unemployment 5
us_counties 5
hvplot us_crime 5
bokeh.sampledata periodic_table 4
Pretty sure the analysis isn't perfect and is somewhat missing/not grouping the files uploaded on our S3. I am sure the penguins
and windturbines
datasets S3 are used enough times that they should appear in the results above. Anyway, I think it's enough to get started and discuss which datasets should be added to hvsampledata
.
Supersedes https://github.com/holoviz/hvplot/issues/1274
Context
The HoloViz tools are very much data-oriented, pretty much every example across the board starts by either loading a dataset or creating one on the fly. Loading datasets is done in various ways:
bokeh.sampledata
. Bokeh ships some data and downloads some others. I don't know when we rely on the latter, but for sure we runbokeh sampledata
quite often on the CI (to download these non-shipped datasets).xarray.tutorial.open_dataset
sample_data
module that exposes and Intake catalog and its datasets. It needs the optional depsintake
,intake_parquet
,intake_xarray
ands3fs
. The catalog is shipped with hvPlot but the datasets are fetched from the internet.https://datasets.holoviz.org
(e.g.https://datasets.holoviz.org/penguins/v1/penguins.csv
)occupancy.csv
inpanel/examples/assets
)This isn't great:
There has to be a better way :) We started to discuss how to improve the situation by creating a new HoloViz package dedicated to giving a unified and easier approach to loading datasets.
How others do it
Bokeh
Bokeh has a
sampledata
sub-package that gives access to its datasets. They ship 24 files (CSVs mostly). More datasets are available after being downloaded via either the CLIbokeh sampledata
or the Python APIbokeh.sampledata.download()
. Downloaded files are stored in$HOME/.bokeh/data
(it can be configured). Datasets are in their sub-modules, sometimes grouped.Plotly
Plotly has a
data
sub-package that gives access to 11 datasets. They are all csv.gz files and are all shipped.Altair
The
vega-datasets
package contains 17 shipped datasets and gives access to many more datasets that are downloaded from vega's CDN. Downloaded datasets are not cached.vega-datasets
is not a dependency ofaltair
, it has Pandas as a unique direct dependency. Its code and API are a little more elaborate.Matplotlib
Matplotlib ships with about 15 datasets. The API is quite rudimentary:
Seaborn
Seaborn doesn't ship any datasets. Instead it offers the high-level
loat_dataset
function that fetches files (CSVs mostly) from this Github repository https://github.com/mwaskom/seaborn-data. Datasets are cached by default once downloaded. The cache location can be accessed viasns.get_data_home()
and controlled by passing an argument toload_dataset
or setting the env varSEABORN_DATA
.Xarray
Xarray has implemented an API similar to seaborn, available in the
xarray.tutorial
module. The main difference with seaborn is that runningopen/load_dataset
requires pooch, that can be installed withpip install xarray[io]
along with other deps likenetCDF4
,zarr
orfsspec
. Pooch itself depends onplatformdirs
,packaging
andrequests
.xarray.tutorial
also has thescatter_example_dataset
function that just generates a Dataset using Numpy functions.Proposal
Name
Simon already reserved hvdata on PyPI, though I'm not sure we reached an agreement on the name of this package? I think Andrew called it
hvdataset
in its original issue (https://github.com/holoviz/hvplot/issues/1274).I find
hvdata
a little too close tohvplot
in the spirit and not super explicit. I preferhvdatasets
, even if I'm a bit annoyed it'll show up beforehvplot
in my editor.API
Considering all the approaches above, I think my favorite API is when the datasets are available via a callable (plotly, vega-datasets) that returns a data object, as this enables autocompletion and allows us to augment their signature with optional parameters.
I also like what Plotly does by providing a unique interface to their datasets for Plotly and Plotly express.
Datasets
The package will ship a list (to be determined) of small datasets. They should be as small as possible. We need to watch the overall package size, setting a limit upfront would be a good idea.
Other datasets can be downloaded on the fly. They're cached.
Dependencies
I'm not sure. None, or maybe just
pandas
?platformdirs
is good for finding the right cache paths on various platforms, but I think we can do without it. The package will have optional dependencies that will be required for its test suite to pass (e.g. xarray).If we want to increase the likelihood for users to be able to run code snippets without getting any errors, then the package should be added as a direct dependency to the HoloViz packages that use it? After all, Plotly, Bokeh, Matplotlib all ship with their package some datasets (and others like plotnine, and possibly many more), so why not us?
Unfortunately, I don't think there's a way yet in the pip world to define default extras. But if there was, I'd go with that option, making a hypothetical default
datasets
extra, with an option for users not to install it (e.g. to reduce their environment size).If we don't make it a direct dependencies of some packages, then their documentation will have to explain that it needs to be installed (either directly or via a new extra or the existing recommended one).
Documentation
The package will have a simple website built with the usual Sphinx stack. It'll list all the datasets it contains, with a description and their license.