Closed orioles79 closed 3 years ago
A Polars Series has an apply method which allows you to run an Python function over a Series. Read more in the book. However the mkdiff
function could be replaced by something like this:
s = pl.Series("foo", [1, 5, 10])
s.shift(-1) - s
Series: 'foo' [i64]
[
4
5
null
]
finally, are there functions to easily rename and reorder columns in a pypolars dataframe? You can reorder a
DataFrame
by indexing the new column order
Reorder
df = pl.DataFrame({"a": [1, 2], "b": [2, 3]})
df_new = df[["b", "a"]]
Rename
Renaming can by done by overwriting the columns
attribute:
df.columns = ["new_a", "new_b"]
also, are dataframe sorts and joins limited to a single column key?
Sorts are limited to a single column, (what would be the order of two column values?). Joins are limited for the moment but could be easily extended with multiple columns. For now, you could try a hack by creating a dummy column of utf8
values. (make sure you've got py-polars >= 0.1.1
)
df_a = pl.DataFrame({"a": [1, 2, 1, 1], "b": ["a", "b", "c", "c"], "c": np.arange(4)})
df_b = pl.DataFrame({"foo": [1, 1, 1], "bar": ["a", "c", "c"], "ham": ["let", "var", "const"]})
df_a["dummy"] = df_a["a"].cast(str) + df_a["b"].cast(str)
df_b["dummy"] = df_b["foo"].cast(str) + df_b["bar"].cast(str)
df_a.join(df_b, on="dummy", how="left")
thanks a lot for the response. very helpful.
much of my difficulty may be in versions of the library and doc.
so here's my dataframe, jhucombinem (shape, var names, and var types). the data are sorted by uid, then date within uid. i had created the dummy key concatenating the two. I wish to group by uid, then compute the difference in cumcases and cumdeaths for each date. Though I'm interested in the pypolars functions, I'd really like to do these calculations using python/numpy.
(1132260, 5) ["key---<class 'pypolars.datatypes.Utf8'>", "uid---<class 'pypolars.datatypes.Int64'>", "date---<class 'pypolars.datatypes.Date32'>", "cumdeaths---<class 'pypolars.datatypes.Int64'>", "cumcases---<class 'pypolars.datatypes.Int64'>"]
AttributeError Traceback (most recent call last)
I've updated py-polars
to better accommodate your problem for applying custom functions after a groupby
operation.
I don't know if you are using the lazy or eager API, so I will give you examples of both.
df = pl.DataFrame({"a": [1, 2, 1, 1],
"b": ["a", "b", "c", "c"]})
def mkdiff(cumcases):
out = np.hstack([np.array(np.nan),np.diff(np.array(cumcases),1)])
return pl.Series(out) # make sure to return a Series!
(df.lazy()
.groupby("b")
.agg([col("a").apply_groups(mkdiff)])
.sort("b")
.collect()
)
df.groupby("b").select("a").apply(mkdiff).sort("b")
Both output
shape: (3, 2)
╭─────┬──────────────╮
│ b ┆ a │
│ --- ┆ --- │
│ str ┆ list [null] │
╞═════╪══════════════╡
│ a ┆ "[NaN]" │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ "[NaN]" │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ "[NaN, 0.0]" │
╰─────┴──────────────╯
If you want to groupby, apply a function on the groups and join back to the main DataFrame I would recommend using the window functions.
df = pl.DataFrame(
{
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
}
)
def concat_fruits(x: pl.Series) -> str:
concat = ""
for fruit in x:
concat += fruit
return concat
windows = df.lazy().with_columns(
[
col("fruits").apply_groups(concat_fruits, dtype_out=str).over("cars").alias("custom")
]
)
print(windows.sort("fruits").collect())
Outputs
shape: (5, 3)
╭────────┬────────┬────────────────────────╮
│ fruits ┆ cars ┆ custom │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════╪════════╪════════════════════════╡
│ apple ┆ beetle ┆ bananaappleapplebanana │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ beetle ┆ bananaappleapplebanana │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ beetle ┆ bananaappleapplebanana │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ audi ┆ banana │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ beetle ┆ bananaappleapplebanana │
╰────────┴────────┴────────────────────────╯
I'd recommend always settings the dtype_out
values when dealing with custom groups. Polars is not based on numpy and tries to determine upfront what the type of the Series is. This makes custom functions faster, but can go wrong if ill defined.
Next, I'd recommend using polars methods in favor of numpy when performance is important. Polars (especially lazy) tries to parallelize the operations, when dealing with custom functions it cannot because of the Python GIL. Furthermore for numpy interop, the values need to be copied from numpy to polars, which may be costly.
I added a version.
import pypolars as pl
pl.__version__
'0.2.1'
I've added support for joining on multiple columns. The hack should not be needed anymore.
thanks much for 0.2.0 and your code examples! they are most helpful.
still struggling with my req though. in that case, the number of rows returned is the same as the number in the input dataframe. rather than aggregating, i'm simply grabbing groupby records at a time, operating on each and returning them. thought i'd provide you an input dataframe along with what the output column/series would look like. 'daycases' at the end is the computation i'd like to perform with the groupby/apply capability. here it is:
import pypolars as pl pl.version
uid = list(flatten([6*[r] for r in range(3)])) date = ["2020-12-20","2020-12-21","2020-12-22","2020-12-23","2020-12-24","2020-12-25"] cumcases = [20,40,67,80,80,100]
jhudata = pl.DataFrame( { "uid" : uid, "date": np.hstack([date,date,date]), "cumcases" : np.hstack([cumcases,[10c for c in cumcases],[100c for c in cumcases]]) } )
jhudata["date"] = jhudata["date"].str_parse_date(pp.datatypes.Date32, "%Y-%m-%d")
jhudata['key'] = [str(x)+"-"+str(y) for (x,y) in zip(jhudata.uid,jhudata.date)]
jhudata.sort(by_column = 'key', in_place = True)
jhudata['daycases'] = np.hstack([np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[:6]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[6:12]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[12:]),1)])])
(18, 5) ["uid---<class 'pypolars.datatypes.Int32'>", "cumcases---<class 'pypolars.datatypes.Int32'>", "date---<class 'pypolars.datatypes.Date32'>", "key---<class 'pypolars.datatypes.Utf8'>", "daycases---<class 'pypolars.datatypes.Float64'>"]
thx again for your help.
updated. i have cumcases and cumdeaths and would like to efficiently compute daycases and daydeaths. it appears that 0.2.0 is missing to_pandas() and to_csv() functions.
def metapl(df): return([str(x)+"---"+str(y) for (x,y) in zip(df.columns,df.dtypes)])
import pypolars as pl pl.version
uid = list(flatten([6*[r] for r in range(3)])) date = ["2020-12-20","2020-12-21","2020-12-22","2020-12-23","2020-12-24","2020-12-25"] cumcases = [20,40,67,80,80,100]
jhudata = pl.DataFrame( { "uid" : uid, "date": np.hstack([date,date,date]), "cumcases" : np.hstack([cumcases,[10c for c in cumcases],[100c for c in cumcases]]) } )
jhudata['cumdeaths'] = (.25*np.array(jhudata.cumcases)).round()
jhudata["date"] = jhudata["date"].str_parse_date(pp.datatypes.Date32, "%Y-%m-%d")
jhudata['key'] = [str(x)+"-"+str(y) for (x,y) in zip(jhudata.uid,jhudata.date)]
jhudata.sort(by_column = 'key', in_place = True)
jhudata['daycases'] = np.hstack([np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[:6]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[6:12]),1)]), np.hstack([np.array(np.nan),np.diff(np.array(jhudata.cumcases[12:]),1)])])
print(jhudata.shape) metapl(jhudata)
(18, 6) ["uid---<class 'pypolars.datatypes.Int32'>", "cumcases---<class 'pypolars.datatypes.Int32'>", "cumdeaths---<class 'pypolars.datatypes.Float64'>", "date---<class 'pypolars.datatypes.Date32'>", "key---<class 'pypolars.datatypes.Utf8'>", "daycases---<class 'pypolars.datatypes.Float64'>"]
thx for the support.
installed 0.2.1 and the to_pandas() function is included.
here's what the exercise might look like in pandas:
uid = list(flatten([6*[r] for r in range(3)])) date = ["2020-12-20","2020-12-21","2020-12-22","2020-12-23","2020-12-24","2020-12-25"] cumcases = [20,40,67,80,80,100]
jhudatapd = pd.DataFrame( { "uid" : uid, "date": np.hstack([date,date,date]), "cumcases" : np.hstack([cumcases,[10c for c in cumcases],[100c for c in cumcases]]) } )
jhudatapd['date'] = pd.to_datetime(jhudatapd.date, format='%Y-%m-%d') jhudatapd['cumdeaths'] = (.25*np.array(jhudatapd.cumcases)).round()
jhudatapd['key'] = [str(x)+"-"+str(y) for (x,y) in zip(jhudatapd.uid,jhudatapd.date)]
jhudatapd.sort_values(by=['uid', 'date'], inplace=True)
gvars = ['uid','cumcases','cumdeaths'] jhugrp = jhudatapd[gvars].groupby(['uid'])
def mkdiff(x): return(np.hstack([np.array(np.nan),np.diff(np.array(x),1)]))
slug = jhugrp.transform(mkdiff) slug.columns = ['daycases','daydeaths']
jhufinal = pd.concat([jhudatapd,slug],axis=1,ignore_index=True) jhufinal.columns = list(jhudatapd.columns)+list(slug.columns)
pvars = ['uid','date','cumcases','daycases','cumdeaths','daydeaths'] print(jhufinal[pvars])
uid date cumcases daycases cumdeaths daydeaths
0 0 2020-12-20 20 NaN 5.0 NaN 1 0 2020-12-21 40 20.0 10.0 5.0 2 0 2020-12-22 67 27.0 17.0 7.0 3 0 2020-12-23 80 13.0 20.0 3.0 4 0 2020-12-24 80 0.0 20.0 0.0 5 0 2020-12-25 100 20.0 25.0 5.0 6 1 2020-12-20 200 NaN 50.0 NaN 7 1 2020-12-21 400 200.0 100.0 50.0 8 1 2020-12-22 670 270.0 168.0 68.0 9 1 2020-12-23 800 130.0 200.0 32.0 10 1 2020-12-24 800 0.0 200.0 0.0 11 1 2020-12-25 1000 200.0 250.0 50.0 12 2 2020-12-20 2000 NaN 500.0 NaN 13 2 2020-12-21 4000 2000.0 1000.0 500.0 14 2 2020-12-22 6700 2700.0 1675.0 675.0 15 2 2020-12-23 8000 1300.0 2000.0 325.0 16 2 2020-12-24 8000 0.0 2000.0 0.0 17 2 2020-12-25 10000 2000.0 2500.0 500.0
regards.
I think I understand what you are trying to do. If you update to 0.2.2
you can achieve this as follows:
import pypolars as pl
from pypolars.lazy import *
import numpy as np
pl.__version__
'0.2.2'
uid = [item for sublist in [6 * [r] for r in range(3)] for item in sublist]
date = [
"2020-12-20",
"2020-12-21",
"2020-12-22",
"2020-12-23",
"2020-12-24",
"2020-12-25",
]
cumcases = [20, 40, 67, 80, 80, 100]
jhudata = pl.DataFrame(
{
"uid": uid,
"date": np.hstack([date, date, date]),
"cumcases": np.hstack([cumcases, [c for c in cumcases], [c for c in cumcases]]),
}
)
jhudata["cumdeaths"] = (0.25 * np.array(jhudata.cumcases)).round()
print(jhudata)
shape: (18, 4)
╭─────┬────────────┬──────────┬───────────╮
│ uid ┆ date ┆ cumcases ┆ cumdeaths │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞═════╪════════════╪══════════╪═══════════╡
│ 0 ┆ 2020-12-20 ┆ 20 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2020-12-21 ┆ 40 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2020-12-22 ┆ 67 ┆ 17 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2020-12-23 ┆ 80 ┆ 20 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-21 ┆ 40 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-22 ┆ 67 ┆ 17 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-23 ┆ 80 ┆ 20 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-24 ┆ 80 ┆ 20 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-25 ┆ 100 ┆ 25 │
╰─────┴────────────┴──────────┴───────────╯
def mkdiff(cumcases):
return cumcases - cumcases.shift(1)
base_df = (
jhudata.lazy()
# first parse column as date32
.with_column(col("date").str_parse_date(pl.Date32))
# next create a sorting key defined by the group uid + date_integer
.with_column(
(col("uid").cast(str) + lit("-") + col("date").cast(int)).alias("sort_key")
)
# sort all values on the sorting key so that
# the mkdiff function get's sorted values on date per group
.sort("sort_key")
)
# Next we group by uid and aggregate to different
# Series lists that we later explode and join back on the main DataFrame
(
base_df.groupby("uid")
.agg(
[
col("date").list().alias("date"),
col("cumcases").apply_groups(mkdiff).alias("diff_cases"),
col("cumdeaths").apply_groups(mkdiff).alias("diff_deaths"),
]
)
.explode(["date", "diff_cases", "diff_deaths"])
.join(base_df, on=["uid", "date"])
).collect()
shape: (18, 7)
╭─────┬──────────────┬────────────┬─────────────┬──────────┬───────────┬──────────╮
│ uid ┆ date ┆ diff_cases ┆ diff_deaths ┆ cumcases ┆ cumdeaths ┆ sort_key │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date32(days) ┆ i64 ┆ f64 ┆ i64 ┆ f64 ┆ str │
╞═════╪══════════════╪════════════╪═════════════╪══════════╪═══════════╪══════════╡
│ 0 ┆ 2020-12-20 ┆ null ┆ null ┆ 20 ┆ 5 ┆ 0-18616 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2020-12-21 ┆ 20 ┆ 5 ┆ 40 ┆ 10 ┆ 0-18617 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2020-12-22 ┆ 27 ┆ 7 ┆ 67 ┆ 17 ┆ 0-18618 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2020-12-23 ┆ 13 ┆ 3 ┆ 80 ┆ 20 ┆ 0-18619 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-21 ┆ 20 ┆ 5 ┆ 40 ┆ 10 ┆ 2-18617 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-22 ┆ 27 ┆ 7 ┆ 67 ┆ 17 ┆ 2-18618 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-23 ┆ 13 ┆ 3 ┆ 80 ┆ 20 ┆ 2-18619 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-24 ┆ 0 ┆ 0.0 ┆ 80 ┆ 20 ┆ 2-18620 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-12-25 ┆ 20 ┆ 5 ┆ 100 ┆ 25 ┆ 2-18621 │
╰─────┴──────────────┴────────────┴─────────────┴──────────┴───────────┴──────────╯
thx. I updated to 0.2.3 and got your code to work.
my efforts with my 1M+ record dataframe have not gone well, however: the performance is very poor, seemingly exponential with the number of records. I suspect the problem may well be my own ignorance of the details of pypolar programming.
here's some info. I'd be happy to get you a copy of the input dataframe.
import pypolars as pl from pypolars.lazy import * import numpy as np pl.version
'0.2.3'
print(jhucombinem.shape) print(jhucombinem.uid.unique().len()) metapp(jhucombinem)
(1142280, 6) 3340 ["key---<class 'pypolars.datatypes.Utf8'>", "uid---<class 'pypolars.datatypes.Int64'>", "state---<class 'pypolars.datatypes.Utf8'>", "date---<class 'pypolars.datatypes.Date32'>", "cumdeaths---<class 'pypolars.datatypes.Int64'>", "cumcases---<class 'pypolars.datatypes.Int64'>"]
start = time.time()
def mkdiff(cumcases): return cumcases - cumcases.shift(1)
slug = (
jhucombinem[:250000].lazy()
.sort("key")
)
( slug.groupby("uid") .agg( [ col("date").list().alias("date"), col("cumcases").apply_groups(mkdiff).alias("daycases"), col("cumdeaths").apply_groups(mkdiff).alias("daydeaths"), ] ) .explode(["date", "daycases", "daydeaths"]) .join(slug, on=["uid", "date"]) ).collect()
end = time.time() print(end - start)
blanks(2)
166.34725785255432
how do you create a dataframe from a pypolars.lazy.LazyFrame?
thx.
Yes.. If you could send me your dataset I could take a look.
P.S. You can use markdown to format the code you paste in here. That would make it more readable :wink:
The explode
implementation was ridicilously slow. It turns out that the data is already stored in memory in exploded
format, so the new implementation is insanely fast. The whole query on your 1M+ rows now runs in ~1.3 seconds on my machine. A new release is now being build under tag 0.2.4.
. Let me know if it succeeds on your side.
Ritchie —
great! I look forward to additional experimentation. is there a from_pandas() function to convert from pandas to pypolars? also, I want to continue examining the agg function, which appears pretty flexible, capable of performing like pandas’s transform and apply.
the JHU cases and deaths data make for a good test — from read to melt to join to apply. I’ve successfully put the data through its paces with R data.table, Python pandas, and Julia dataframe. Python datatable is promising but not yet ready. I’m liking what I see of pypolars.
Regards, Steve 847.778.1145
On Dec 29, 2020, at 1:36 PM, Ritchie Vink notifications@github.com wrote:
The explode implementation was ridicilously slow. It turns out that the data is already stored in memory in exploded format, so the new implementation is insanely fast. The whole query on your 1M+ rows now runs in ~1.3 seconds on my machine. A new release is now being build under tag 0.2.4.. Let me know if it succeeds on your side.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
i have cases/fatalities covid data from johns hopkins that i've been able to easily work through with R data.table, pandas dataframe, and julia dataframe. I've started with pypolars but seem to have hit a snag with the split/apply/combine functionality. the "key" to the dataframe i'm working with is the combination (uid,date). the data that JHU provides for that key are cumulative cases and fatalities. i need to split the data on uid, then for each grouping, compute daily cases/fatalities as the difference in successive cumulatives in date order, returning all records. i would simply like to define a function much like ones i've written with pandas/julia that look like this:
is there an apply function that can accept a python routine like the above and allow me to write something like this:
also, are dataframe sorts and joins limited to a single column key?
finally, are there functions to easily rename and reorder columns in a pypolars dataframe?