pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.91k stars 1.93k forks source link

Built-in datasets and a function to load them #17896

Closed Zaf4 closed 2 months ago

Zaf4 commented 2 months ago

Description

Some built-in datasets would be very useful for documentations, tutorials, and testing-out functionalities etc.

example,

df = pl.datasets.load('starbucks') or seaborn approach pl.load_dataset('starbucks') But the former makes it possible to have a function like pl.datasets.names() or pl.datasets.available() Lazy could be an optional parameter with the default value of False.

Datasets could directly be within the lib (for off-line cases) or they could be loaded from the repo simply calling pl.read_ipc('https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/nation.feather')

While some datasets are already available, that could be expanded with some publicly available datasets for the different test cases.

I believe datasets should be consistently in one single format (avoiding having to cover every case), preferably parquet or feather for fast loading and smaller file sizes.

Zaf4 commented 2 months ago

Roughly something like the following...

import polars as pl

DATASETS = {
    'nation': 'https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/nation.feather',
    'supplier': 'https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/supplier.feather',
}
def load(name:str, lazy:bool=False,):
    func = pl.scan_ipc if lazy else pl.read_ipc
    return func(DATASETS.get(name))
def available():
    return list(DATASETS.keys())
load('nation', lazy=True)

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

<div>Ipc SCAN [https://github.com/pola-rs/polars/raw/main/examples/datasets/tpc_heads/nation.feather]<p></p>PROJECT */4 COLUMNS</div>
load('supplier').head()
shape: (5, 7)
s_suppkeys_names_addresss_nationkeys_phones_acctbals_comment
i64strstri64strf64str
1"Supplier#000000001"" N kD4on9OM Ipw3,gf0JBoQDd7tgr…17"27-918-335-1736"5755.94"each slyly above the careful"
2"Supplier#000000002""89eJ5ksX3ImxJQBvxObC,"5"15-679-861-2259"4032.68" slyly bold instructions. idle…
3"Supplier#000000003""q1,G3Pj6OjIuUYfUoH18BFTKP5aU9b…1"11-383-516-1199"4192.4"blithely silent requests after…
4"Supplier#000000004""Bk7ah4CK8SYQTepEmvMkkgMwg"15"25-843-787-7479"4641.08"riously even requests above th…
5"Supplier#000000005""Gcdm2rJRzl5qlTVzc"11"21-151-690-3663"-283.84". slyly regular pinto bea"
available()
['nation', 'supplier']