pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.01k stars 1.94k forks source link

split/apply/combine capability #229

Closed orioles79 closed 3 years ago

orioles79 commented 3 years ago

i have cases/fatalities covid data from johns hopkins that i've been able to easily work through with R data.table, pandas dataframe, and julia dataframe. I've started with pypolars but seem to have hit a snag with the split/apply/combine functionality. the "key" to the dataframe i'm working with is the combination (uid,date). the data that JHU provides for that key are cumulative cases and fatalities. i need to split the data on uid, then for each grouping, compute daily cases/fatalities as the difference in successive cumulatives in date order, returning all records. i would simply like to define a function much like ones i've written with pandas/julia that look like this:

def mkdiff(cumcases):
    return(np.hstack([np.array(np.nan),np.diff(np.array(cumcases),1)]))

is there an apply function that can accept a python routine like the above and allow me to write something like this:

jhudata.apply({"cumcases":["mkdiff"],"cumdeaths":["mkdiff"]})

also, are dataframe sorts and joins limited to a single column key?

finally, are there functions to easily rename and reorder columns in a pypolars dataframe?

ritchie46 commented 3 years ago

A Polars Series has an apply method which allows you to run an Python function over a Series. Read more in the book. However the mkdiff function could be replaced by something like this:

s = pl.Series("foo", [1, 5, 10])
s.shift(-1) - s
Series: 'foo' [i64]
[
    4
    5
    null
]

finally, are there functions to easily rename and reorder columns in a pypolars dataframe? You can reorder a DataFrame by indexing the new column order

Reorder

df = pl.DataFrame({"a": [1, 2], "b": [2, 3]})
df_new = df[["b", "a"]]

Rename Renaming can by done by overwriting the columns attribute:

df.columns = ["new_a", "new_b"]

also, are dataframe sorts and joins limited to a single column key?

Sorts are limited to a single column, (what would be the order of two column values?). Joins are limited for the moment but could be easily extended with multiple columns. For now, you could try a hack by creating a dummy column of utf8 values. (make sure you've got py-polars >= 0.1.1)

df_a = pl.DataFrame({"a": [1,  2,  1,  1], "b": ["a", "b", "c", "c"],  "c": np.arange(4)})
df_b = pl.DataFrame({"foo": [1, 1, 1], "bar": ["a", "c", "c"], "ham": ["let", "var", "const"]})

df_a["dummy"] = df_a["a"].cast(str) + df_a["b"].cast(str)
df_b["dummy"] = df_b["foo"].cast(str) + df_b["bar"].cast(str)

df_a.join(df_b, on="dummy", how="left")
orioles79 commented 3 years ago

thanks a lot for the response. very helpful.

much of my difficulty may be in versions of the library and doc.

so here's my dataframe, jhucombinem (shape, var names, and var types). the data are sorted by uid, then date within uid. i had created the dummy key concatenating the two. I wish to group by uid, then compute the difference in cumcases and cumdeaths for each date. Though I'm interested in the pypolars functions, I'd really like to do these calculations using python/numpy.

(1132260, 5) ["key---<class 'pypolars.datatypes.Utf8'>", "uid---<class 'pypolars.datatypes.Int64'>", "date---<class 'pypolars.datatypes.Date32'>", "cumdeaths---<class 'pypolars.datatypes.Int64'>", "cumcases---<class 'pypolars.datatypes.Int64'>"]


pp.version

AttributeError Traceback (most recent call last)

in ----> 1 pp.__version__ AttributeError: module 'pypolars' has no attribute '__version__' gvars = ['uid'] ******************** here are my functions: def mkdiff(s): return(np.hstack([np.array(np.nan),np.diff(np.array(s),1)])) mymkdiff = pp.udf(mkdiff, output_type=pp.datatypes.Float64) ************************ gvars = ['uid'] slug = jhucombinem.groupby(gvars).apply(mymkdiff) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 gvars = ['uid'] ----> 2 slug = jhucombinem.groupby(gvars).apply(mymkdiff) AttributeError: 'GroupBy' object has no attribute 'apply' hope this is helpful. I like what I'm seeing with pypolars and will write a blog on it if things work out. these data provide a nice test which i'd be happy to share with you for doc. both cases and fatalities files must be downloaded from the web, melted, joined, and grouped. thanks for the help.
ritchie46 commented 3 years ago

I've updated py-polars to better accommodate your problem for applying custom functions after a groupby operation.

I don't know if you are using the lazy or eager API, so I will give you examples of both.

Lazy

df = pl.DataFrame({"a": [1,  2,  1,  1], 
                   "b": ["a", "b", "c", "c"]})

def mkdiff(cumcases):
    out = np.hstack([np.array(np.nan),np.diff(np.array(cumcases),1)])
    return pl.Series(out)  # make sure to return a Series!

(df.lazy()
     .groupby("b")
     .agg([col("a").apply_groups(mkdiff)])
     .sort("b")
     .collect()
)

Eager

df.groupby("b").select("a").apply(mkdiff).sort("b")

Both output

shape: (3, 2)
╭─────┬──────────────╮
│ b   ┆ a            │
│ --- ┆ ---          │
│ str ┆ list [null]  │
╞═════╪══════════════╡
│ a   ┆ "[NaN]"      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ "[NaN]"      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c   ┆ "[NaN, 0.0]" │
╰─────┴──────────────╯

If you want to groupby, apply a function on the groups and join back to the main DataFrame I would recommend using the window functions.

df = pl.DataFrame(
    {
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

def concat_fruits(x: pl.Series) -> str:
    concat = ""
    for fruit in x:
        concat += fruit
    return concat

windows = df.lazy().with_columns(
    [
        col("fruits").apply_groups(concat_fruits, dtype_out=str).over("cars").alias("custom")
    ]
)

print(windows.sort("fruits").collect())

Outputs

shape: (5, 3)
╭────────┬────────┬────────────────────────╮
│ fruits ┆ cars   ┆ custom                 │
│ ---    ┆ ---    ┆ ---                    │
│ str    ┆ str    ┆ str                    │
╞════════╪════════╪════════════════════════╡
│ apple  ┆ beetle ┆ bananaappleapplebanana │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ beetle ┆ bananaappleapplebanana │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ beetle ┆ bananaappleapplebanana │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ audi   ┆ banana                 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ beetle ┆ bananaappleapplebanana │
╰────────┴────────┴────────────────────────╯

Various

Custom functions

I'd recommend always settings the dtype_out values when dealing with custom groups. Polars is not based on numpy and tries to determine upfront what the type of the Series is. This makes custom functions faster, but can go wrong if ill defined.

Next, I'd recommend using polars methods in favor of numpy when performance is important. Polars (especially lazy) tries to parallelize the operations, when dealing with custom functions it cannot because of the Python GIL. Furthermore for numpy interop, the values need to be copied from numpy to polars, which may be costly.

version

I added a version.

import pypolars as pl
pl.__version__
'0.2.1'

multiple columns

I've added support for joining on multiple columns. The hack should not be needed anymore.

orioles79 commented 3 years ago

thanks much for 0.2.0 and your code examples! they are most helpful.

still struggling with my req though. in that case, the number of rows returned is the same as the number in the input dataframe. rather than aggregating, i'm simply grabbing groupby records at a time, operating on each and returning them. thought i'd provide you an input dataframe along with what the output column/series would look like. 'daycases' at the end is the computation i'd like to perform with the groupby/apply capability. here it is:

import pypolars as pl pl.version

uid = list(flatten([6*[r] for r in range(3)])) date = ["2020-12-20","2020-12-21","2020-12-22","2020-12-23","2020-12-24","2020-12-25"] cumcases = [20,40,67,80,80,100]

jhudata = pl.DataFrame( { "uid" : uid, "date": np.hstack([date,date,date]), "cumcases" : np.hstack([cumcases,[10c for c in cumcases],[100c for c in cumcases]]) } )

jhudata["date"] = jhudata["date"].str_parse_date(pp.datatypes.Date32, "%Y-%m-%d")

jhudata['key'] = [str(x)+"-"+str(y) for (x,y) in zip(jhudata.uid,jhudata.date)]

jhudata.sort(by_column = 'key', in_place = True)

jhudata['daycases'] = np.hstack([np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[:6]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[6:12]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[12:]),1)])])

(18, 5) ["uid---<class 'pypolars.datatypes.Int32'>", "cumcases---<class 'pypolars.datatypes.Int32'>", "date---<class 'pypolars.datatypes.Date32'>", "key---<class 'pypolars.datatypes.Utf8'>", "daycases---<class 'pypolars.datatypes.Float64'>"]

thx again for your help.

orioles79 commented 3 years ago

updated. i have cumcases and cumdeaths and would like to efficiently compute daycases and daydeaths. it appears that 0.2.0 is missing to_pandas() and to_csv() functions.

def metapl(df): return([str(x)+"---"+str(y) for (x,y) in zip(df.columns,df.dtypes)])

import pypolars as pl pl.version

uid = list(flatten([6*[r] for r in range(3)])) date = ["2020-12-20","2020-12-21","2020-12-22","2020-12-23","2020-12-24","2020-12-25"] cumcases = [20,40,67,80,80,100]

jhudata = pl.DataFrame( { "uid" : uid, "date": np.hstack([date,date,date]), "cumcases" : np.hstack([cumcases,[10c for c in cumcases],[100c for c in cumcases]]) } )

jhudata['cumdeaths'] = (.25*np.array(jhudata.cumcases)).round()

jhudata["date"] = jhudata["date"].str_parse_date(pp.datatypes.Date32, "%Y-%m-%d")

jhudata['key'] = [str(x)+"-"+str(y) for (x,y) in zip(jhudata.uid,jhudata.date)]

jhudata.sort(by_column = 'key', in_place = True)

jhudata['daycases'] = np.hstack([np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[:6]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[6:12]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[12:]),1)])])

print(jhudata.shape) metapl(jhudata)

(18, 6) ["uid---<class 'pypolars.datatypes.Int32'>", "cumcases---<class 'pypolars.datatypes.Int32'>", "cumdeaths---<class 'pypolars.datatypes.Float64'>", "date---<class 'pypolars.datatypes.Date32'>", "key---<class 'pypolars.datatypes.Utf8'>", "daycases---<class 'pypolars.datatypes.Float64'>"]

thx for the support.

orioles79 commented 3 years ago

installed 0.2.1 and the to_pandas() function is included.

here's what the exercise might look like in pandas:

uid = list(flatten([6*[r] for r in range(3)])) date = ["2020-12-20","2020-12-21","2020-12-22","2020-12-23","2020-12-24","2020-12-25"] cumcases = [20,40,67,80,80,100]

jhudatapd = pd.DataFrame( { "uid" : uid, "date": np.hstack([date,date,date]), "cumcases" : np.hstack([cumcases,[10c for c in cumcases],[100c for c in cumcases]]) } )

jhudatapd['date'] = pd.to_datetime(jhudatapd.date, format='%Y-%m-%d') jhudatapd['cumdeaths'] = (.25*np.array(jhudatapd.cumcases)).round()

jhudatapd['key'] = [str(x)+"-"+str(y) for (x,y) in zip(jhudatapd.uid,jhudatapd.date)]

jhudatapd.sort_values(by=['uid', 'date'], inplace=True)

gvars = ['uid','cumcases','cumdeaths'] jhugrp = jhudatapd[gvars].groupby(['uid'])

def mkdiff(x): return(np.hstack([np.array(np.nan),np.diff(np.array(x),1)]))

slug = jhugrp.transform(mkdiff) slug.columns = ['daycases','daydeaths']

jhufinal = pd.concat([jhudatapd,slug],axis=1,ignore_index=True) jhufinal.columns = list(jhudatapd.columns)+list(slug.columns)

pvars = ['uid','date','cumcases','daycases','cumdeaths','daydeaths'] print(jhufinal[pvars])

uid       date  cumcases  daycases  cumdeaths  daydeaths

0 0 2020-12-20 20 NaN 5.0 NaN 1 0 2020-12-21 40 20.0 10.0 5.0 2 0 2020-12-22 67 27.0 17.0 7.0 3 0 2020-12-23 80 13.0 20.0 3.0 4 0 2020-12-24 80 0.0 20.0 0.0 5 0 2020-12-25 100 20.0 25.0 5.0 6 1 2020-12-20 200 NaN 50.0 NaN 7 1 2020-12-21 400 200.0 100.0 50.0 8 1 2020-12-22 670 270.0 168.0 68.0 9 1 2020-12-23 800 130.0 200.0 32.0 10 1 2020-12-24 800 0.0 200.0 0.0 11 1 2020-12-25 1000 200.0 250.0 50.0 12 2 2020-12-20 2000 NaN 500.0 NaN 13 2 2020-12-21 4000 2000.0 1000.0 500.0 14 2 2020-12-22 6700 2700.0 1675.0 675.0 15 2 2020-12-23 8000 1300.0 2000.0 325.0 16 2 2020-12-24 8000 0.0 2000.0 0.0 17 2 2020-12-25 10000 2000.0 2500.0 500.0

regards.

ritchie46 commented 3 years ago

I think I understand what you are trying to do. If you update to 0.2.2 you can achieve this as follows:

Imports

import pypolars as pl
from pypolars.lazy import *
import numpy as np
pl.__version__
'0.2.2'

Setup

uid = [item for sublist in [6 * [r] for r in range(3)] for item in sublist]
date = [
    "2020-12-20",
    "2020-12-21",
    "2020-12-22",
    "2020-12-23",
    "2020-12-24",
    "2020-12-25",
]
cumcases = [20, 40, 67, 80, 80, 100]
jhudata = pl.DataFrame(
    {
        "uid": uid,
        "date": np.hstack([date, date, date]),
        "cumcases": np.hstack([cumcases, [c for c in cumcases], [c for c in cumcases]]),
    }
)
jhudata["cumdeaths"] = (0.25 * np.array(jhudata.cumcases)).round()

print(jhudata)
shape: (18, 4)
╭─────┬────────────┬──────────┬───────────╮
│ uid ┆ date       ┆ cumcases ┆ cumdeaths │
│ --- ┆ ---        ┆ ---      ┆ ---       │
│ i64 ┆ str        ┆ i64      ┆ f64       │
╞═════╪════════════╪══════════╪═══════════╡
│ 0   ┆ 2020-12-20 ┆ 20       ┆ 5         │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 2020-12-21 ┆ 40       ┆ 10        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 2020-12-22 ┆ 67       ┆ 17        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 2020-12-23 ┆ 80       ┆ 20        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...        ┆ ...      ┆ ...       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-21 ┆ 40       ┆ 10        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-22 ┆ 67       ┆ 17        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-23 ┆ 80       ┆ 20        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-24 ┆ 80       ┆ 20        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-25 ┆ 100      ┆ 25        │
╰─────┴────────────┴──────────┴───────────╯

Determine difference per group.

def mkdiff(cumcases):
    return cumcases - cumcases.shift(1)

base_df = (
    jhudata.lazy()
    # first parse column as date32
    .with_column(col("date").str_parse_date(pl.Date32))
    # next create a sorting key defined by the group uid + date_integer
    .with_column(
        (col("uid").cast(str) + lit("-") + col("date").cast(int)).alias("sort_key")
    )
    # sort all values on the sorting key so that
    # the mkdiff function get's sorted values on date per group
    .sort("sort_key")
)

# Next we group by uid and aggregate to different
# Series lists that we later explode and join back on the main DataFrame
(
    base_df.groupby("uid")
    .agg(
        [
            col("date").list().alias("date"),
            col("cumcases").apply_groups(mkdiff).alias("diff_cases"),
            col("cumdeaths").apply_groups(mkdiff).alias("diff_deaths"),
        ]
    )
    .explode(["date", "diff_cases", "diff_deaths"])
    .join(base_df, on=["uid", "date"])
).collect()
shape: (18, 7)
╭─────┬──────────────┬────────────┬─────────────┬──────────┬───────────┬──────────╮
│ uid ┆ date         ┆ diff_cases ┆ diff_deaths ┆ cumcases ┆ cumdeaths ┆ sort_key │
│ --- ┆ ---          ┆ ---        ┆ ---         ┆ ---      ┆ ---       ┆ ---      │
│ i64 ┆ date32(days) ┆ i64        ┆ f64         ┆ i64      ┆ f64       ┆ str      │
╞═════╪══════════════╪════════════╪═════════════╪══════════╪═══════════╪══════════╡
│ 0   ┆ 2020-12-20   ┆ null       ┆ null        ┆ 20       ┆ 5         ┆ 0-18616  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 2020-12-21   ┆ 20         ┆ 5           ┆ 40       ┆ 10        ┆ 0-18617  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 2020-12-22   ┆ 27         ┆ 7           ┆ 67       ┆ 17        ┆ 0-18618  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 2020-12-23   ┆ 13         ┆ 3           ┆ 80       ┆ 20        ┆ 0-18619  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...          ┆ ...        ┆ ...         ┆ ...      ┆ ...       ┆ ...      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-21   ┆ 20         ┆ 5           ┆ 40       ┆ 10        ┆ 2-18617  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-22   ┆ 27         ┆ 7           ┆ 67       ┆ 17        ┆ 2-18618  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-23   ┆ 13         ┆ 3           ┆ 80       ┆ 20        ┆ 2-18619  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-24   ┆ 0          ┆ 0.0         ┆ 80       ┆ 20        ┆ 2-18620  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2020-12-25   ┆ 20         ┆ 5           ┆ 100      ┆ 25        ┆ 2-18621  │
╰─────┴──────────────┴────────────┴─────────────┴──────────┴───────────┴──────────╯
orioles79 commented 3 years ago

thx. I updated to 0.2.3 and got your code to work.

my efforts with my 1M+ record dataframe have not gone well, however: the performance is very poor, seemingly exponential with the number of records. I suspect the problem may well be my own ignorance of the details of pypolar programming.

here's some info. I'd be happy to get you a copy of the input dataframe.

import pypolars as pl from pypolars.lazy import * import numpy as np pl.version

'0.2.3'

print(jhucombinem.shape) print(jhucombinem.uid.unique().len()) metapp(jhucombinem)

(1142280, 6) 3340 ["key---<class 'pypolars.datatypes.Utf8'>", "uid---<class 'pypolars.datatypes.Int64'>", "state---<class 'pypolars.datatypes.Utf8'>", "date---<class 'pypolars.datatypes.Date32'>", "cumdeaths---<class 'pypolars.datatypes.Int64'>", "cumcases---<class 'pypolars.datatypes.Int64'>"]

start = time.time()

def mkdiff(cumcases): return cumcases - cumcases.shift(1)

slug = (

jhucombinem.lazy()

jhucombinem[:250000].lazy()
.sort("key")

)

( slug.groupby("uid") .agg( [ col("date").list().alias("date"), col("cumcases").apply_groups(mkdiff).alias("daycases"), col("cumdeaths").apply_groups(mkdiff).alias("daydeaths"), ] ) .explode(["date", "daycases", "daydeaths"]) .join(slug, on=["uid", "date"]) ).collect()

end = time.time() print(end - start)

blanks(2)

166.34725785255432

how do you create a dataframe from a pypolars.lazy.LazyFrame?

thx.

ritchie46 commented 3 years ago

Yes.. If you could send me your dataset I could take a look.

P.S. You can use markdown to format the code you paste in here. That would make it more readable :wink:

ritchie46 commented 3 years ago

The explode implementation was ridicilously slow. It turns out that the data is already stored in memory in exploded format, so the new implementation is insanely fast. The whole query on your 1M+ rows now runs in ~1.3 seconds on my machine. A new release is now being build under tag 0.2.4.. Let me know if it succeeds on your side.

orioles79 commented 3 years ago

Ritchie —

great! I look forward to additional experimentation. is there a from_pandas() function to convert from pandas to pypolars? also, I want to continue examining the agg function, which appears pretty flexible, capable of performing like pandas’s transform and apply.

the JHU cases and deaths data make for a good test — from read to melt to join to apply. I’ve successfully put the data through its paces with R data.table, Python pandas, and Julia dataframe. Python datatable is promising but not yet ready. I’m liking what I see of pypolars.

Regards, Steve 847.778.1145

On Dec 29, 2020, at 1:36 PM, Ritchie Vink notifications@github.com wrote:

 The explode implementation was ridicilously slow. It turns out that the data is already stored in memory in exploded format, so the new implementation is insanely fast. The whole query on your 1M+ rows now runs in ~1.3 seconds on my machine. A new release is now being build under tag 0.2.4.. Let me know if it succeeds on your side.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ritchie46 commented 3 years ago

Great to hear. Yes you can convert to pandas with the to_pandas method of DataFrame