Closed Zaf4 closed 2 months ago
Roughly something like the following...
import polars as pl
DATASETS = {
'nation': 'https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/nation.feather',
'supplier': 'https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/supplier.feather',
}
def load(name:str, lazy:bool=False,):
func = pl.scan_ipc if lazy else pl.read_ipc
return func(DATASETS.get(name))
def available():
return list(DATASETS.keys())
load('nation', lazy=True)
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
<div>Ipc SCAN [https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/nation.feather]<p></p>PROJECT */4 COLUMNS</div>
load('supplier').head()
s_suppkey | s_name | s_address | s_nationkey | s_phone | s_acctbal | s_comment |
---|---|---|---|---|---|---|
i64 | str | str | i64 | str | f64 | str |
1 | "Supplier#000000001" | " N kD4on9OM Ipw3,gf0JBoQDd7tgr… | 17 | "27-918-335-1736" | 5755.94 | "each slyly above the careful" |
2 | "Supplier#000000002" | "89eJ5ksX3ImxJQBvxObC," | 5 | "15-679-861-2259" | 4032.68 | " slyly bold instructions. idle… |
3 | "Supplier#000000003" | "q1,G3Pj6OjIuUYfUoH18BFTKP5aU9b… | 1 | "11-383-516-1199" | 4192.4 | "blithely silent requests after… |
4 | "Supplier#000000004" | "Bk7ah4CK8SYQTepEmvMkkgMwg" | 15 | "25-843-787-7479" | 4641.08 | "riously even requests above th… |
5 | "Supplier#000000005" | "Gcdm2rJRzl5qlTVzc" | 11 | "21-151-690-3663" | -283.84 | ". slyly regular pinto bea" |
available()
['nation', 'supplier']
Description
Some built-in datasets would be very useful for documentations, tutorials, and testing-out functionalities etc.
example,
df = pl.datasets.load('starbucks')
or seaborn approachpl.load_dataset('starbucks')
But the former makes it possible to have a function likepl.datasets.names()
orpl.datasets.available()
Lazy could be an optional parameter with the default value of False.Datasets could directly be within the lib (for off-line cases) or they could be loaded from the repo simply calling
pl.read_ipc('https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/nation.feather')
While some datasets are already available, that could be expanded with some publicly available datasets for the different test cases.
I believe datasets should be consistently in one single format (avoiding having to cover every case), preferably parquet or feather for fast loading and smaller file sizes.