Closed maxim-uvarov closed 5 months ago
Before asking here, I checked the issues in the Polars repository, and it seems like nobody is aware of such problems even after 5 days since the release of the update.
Hi @maxim-uvarov I have released a new version the build on my box using your dataset takes 150ms to run your query:
time dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()'
shape: (21, 3)
┌──────┬────────┬──────────┐
│ year ┆ count ┆ sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪════════╪══════════╡
│ 2000 ┆ 240730 ┆ 9107793 │
│ 2001 ┆ 239301 ┆ 9034416 │
│ 2002 ┆ 240243 ┆ 9128109 │
│ 2003 ┆ 243470 ┆ 9458073 │
│ 2004 ┆ 252073 ┆ 10273596 │
│ … ┆ … ┆ … │
│ 2016 ┆ 266958 ┆ 12551199 │
│ 2017 ┆ 268989 ┆ 12894723 │
│ 2018 ┆ 270072 ┆ 13031973 │
│ 2019 ┆ 272229 ┆ 13297923 │
│ 2020 ┆ 273265 ┆ 13517976 │
└──────┴────────┴──────────┘
dply -c 0.62s user 0.09s system 471% cpu 0.151 total
Give it a try let me know how it goes.
@vincev Thank you for such a swift release!
I installed the new version, but I haven't noticed any changes. I'm wondering what your performance was like before the new release. Did you experience the same slowness?
> time dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()'
shape: (21, 3)
┌──────┬────────┬──────────┐
│ year ┆ count ┆ sum │
...
│ 2020 ┆ 273265 ┆ 13517976 │
└──────┴────────┴──────────┘
dply -c 0.48s user 25.10s system 365% cpu 7.001 total
> dply -V
dply 0.3.1
@maxim-uvarov I have tried version 0.3.0 that used polars 0.38.3 and the performance is about the same:
$ sw_vers
ProductName: macOS
ProductVersion: 14.5
BuildVersion: 23F79
$ dply --version
dply 0.3.0
$ time dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()'
shape: (21, 3)
┌──────┬────────┬──────────┐
│ year ┆ count ┆ sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪════════╪══════════╡
│ 2000 ┆ 240730 ┆ 9107793 │
│ 2001 ┆ 239301 ┆ 9034416 │
│ 2002 ┆ 240243 ┆ 9128109 │
│ 2003 ┆ 243470 ┆ 9458073 │
│ 2004 ┆ 252073 ┆ 10273596 │
│ … ┆ … ┆ … │
│ 2016 ┆ 266958 ┆ 12551199 │
│ 2017 ┆ 268989 ┆ 12894723 │
│ 2018 ┆ 270072 ┆ 13031973 │
│ 2019 ┆ 272229 ┆ 13297923 │
│ 2020 ┆ 273265 ┆ 13517976 │
└──────┴────────┴──────────┘
dply -c 0.65s user 0.09s system 511% cpu 0.144 total
so it looks the drop in performance you have observed is not related to Polars.
One thing I have noticed from your timings is that the system time is very high could it be that you are low on memory?
Maybe you can try using /usr/bin/time -l
that gives more info, this is what I get:
/usr/bin/time -l dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()'
shape: (21, 3)
┌──────┬────────┬──────────┐
│ year ┆ count ┆ sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪════════╪══════════╡
│ 2000 ┆ 240730 ┆ 9107793 │
│ 2001 ┆ 239301 ┆ 9034416 │
│ 2002 ┆ 240243 ┆ 9128109 │
│ 2003 ┆ 243470 ┆ 9458073 │
│ 2004 ┆ 252073 ┆ 10273596 │
│ … ┆ … ┆ … │
│ 2016 ┆ 266958 ┆ 12551199 │
│ 2017 ┆ 268989 ┆ 12894723 │
│ 2018 ┆ 270072 ┆ 13031973 │
│ 2019 ┆ 272229 ┆ 13297923 │
│ 2020 ┆ 273265 ┆ 13517976 │
└──────┴────────┴──────────┘
0.14 real 0.62 user 0.08 sys
473284608 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
39063 page reclaims
29 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
750 involuntary context switches
6084745642 instructions retired
1553002484 cycles elapsed
384780992 peak memory footprint
@vincev, yep, seems like something is messing with dply on my system. I have 32gb of ram, and never had problem because of it.
I have 18888 involuntary context switches
versus yours 750 involuntary context switches
I'll try to debug it today. I will keep you informed about my findings.
@vincev Thank you so much for your attention!
/usr/bin/time -l dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()'
shape: (21, 3)
┌──────┬────────┬──────────┐
│ year ┆ count ┆ sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪════════╪══════════╡
│ 2000 ┆ 240730 ┆ 9107793 │
│ 2001 ┆ 239301 ┆ 9034416 │
│ 2002 ┆ 240243 ┆ 9128109 │
│ 2003 ┆ 243470 ┆ 9458073 │
│ 2004 ┆ 252073 ┆ 10273596 │
│ … ┆ … ┆ … │
│ 2016 ┆ 266958 ┆ 12551199 │
│ 2017 ┆ 268989 ┆ 12894723 │
│ 2018 ┆ 270072 ┆ 13031973 │
│ 2019 ┆ 272229 ┆ 13297923 │
│ 2020 ┆ 273265 ┆ 13517976 │
└──────┴────────┴──────────┘
27.91 real 0.43 user 168.87 sys
523173888 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
39232 page reclaims
515 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
18888 involuntary context switches
108996644731 instructions retired
514845525269 cycles elapsed
370100992 peak memory footprint
I installed a VM with macOS 14.5 on a Mac with macOS 14.5. Inside the virtual machine, dply
runs very fast. So I can conclude that the problem is not with 14.5, but with my host operating system.
Also, I disabled my login applications, restarted, but no luck. I started my macOS in safe mode, and again, no luck.
This is very strange, one thing you could try to see if it is disk related is to convert the data to parquet and see if there is any improvement:
dply -c 'csv("./Data7602DescendingYearOrder.csv") | parquet("data.parquet")'
or you could run a profiler like instruments to see what is taking all the time but that is a lot of work.
@vincev 👏👏👏 indeed! converting to parquet helped! How did you know that??
> 1..100 | each {timeit {dply -c 'parquet("./data.parquet") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬──────────────────╮
│ max │ 73ms 401µs 917ns │
│ min │ 61ms 454µs 209ns │
│ median │ 65ms 562µs 166ns │
╰────────┴──────────────────╯
> 1..10 | each {timeit {dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬─────────────────────────╮
│ max │ 23sec 630ms 989µs 542ns │
│ min │ 1sec 57ms 889µs 667ns │
│ median │ 3sec 658ms 593µs 437ns │
╰────────┴─────────────────────────╯
So the bug might be in csv reader, right? I have been working with 250MB CSV files using dataframes in nushell without any issues for a long time. If they were to become slow, I would have noticed that.
So I'm guessing, that something might have changed in polars csv reader.
@vincev, what do you think? or maybe you have another ideas?
These fluctuations in the csv file are a bit weird. Here's two tests with parquet and two with csv on my MBP M1Max 32gb ram 1TB ssd.
Parquet
❯ 1..100 | each {timeit {polars open data.parquet | polars group-by year | polars agg (polars col geo_count | polars sum) | polars collect| null}} | | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬──────────────────╮
│ max │ 43ms 129µs 709ns │
│ min │ 20ms 727µs 166ns │
│ median │ 21ms 661µs 562ns │
╰────────┴──────────────────╯
❯ 1..100 | each {timeit {polars open data.parquet | polars group-by year | polars agg (polars col geo_count | polars sum) | polars collect| null}} | | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬──────────────────╮
│ max │ 37ms 550µs 542ns │
│ min │ 20ms 334µs 292ns │
│ median │ 21ms 293µs 896ns │
╰────────┴──────────────────╯
CSV
❯ 1..10 | each {timeit {dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬────────────────────────╮
│ max │ 21sec 75ms 635µs 958ns │
│ min │ 538ms 485µs 291ns │
│ median │ 5sec 547ms 601µs 520ns │
╰────────┴────────────────────────╯
❯ 1..10 | each {timeit {dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬──────────────────────────────╮
│ max │ 1min 16sec 684ms 528µs 250ns │
│ min │ 553ms 545µs 500ns │
│ median │ 1sec 784ms 562µs 375ns │
╰────────┴──────────────────────────────╯
I have a 4tb hard drive, and lot's of small text files on it. Maybe some indexing process, that was initiated after upgrade and haven't finished yet is misbehaving.
@fdncred do you have any deviations from average users in terms of hard drive size or the number of files and text files?
not really. i don't keep a lot of small text files around on purpose, other than rust code in repos.
converting to parquet helped! How did you know that??
@maxim-uvarov I didn't really know it was just a guess, when the system time is very high it may be that the process is doing lots IO (networking/disk) so trying a file format that is more efficient may help, the parquet file is 9x smaller with the data encoded in binary.
Regarding the Polars CSV reader I have tried to run the tests you have runs above on my box and I get consistent timings:
> 1..100 | each {timeit {dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 153ms 753µs 250ns │
│ min │ 112ms 824µs 459ns │
│ median │ 117ms 658µs 729ns │
╰────────┴───────────────────╯
and when you run it on the VM you got good results too so it looks the high variation in timings may be due to some external factors but is hard to say without spending time profiling.
@fdncred the polars plugin looks very nice.
@fdncred the polars plugin looks very nice.
Thanks. We're working on it but we're seeing some inconsistencies, that's when we decided to try dply since it's somewhat related. We wanted to see if we saw the same similarities, specifically with csv files. I'm still not sure what's going on specifically with macos.
These timings are with an old desktop system in Windows and seem somewhat reasonable with parquet and csv.
dply with csv
❯ 1..100 | each {timeit {dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 388ms 375µs 700ns │
│ min │ 264ms 181µs 400ns │
│ median │ 286ms 849µs 200ns │
╰────────┴───────────────────╯
dply with parquet
❯ 1..100 | each {timeit {dply -c 'parquet("./data.parquet") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 315ms 194µs 500ns │
│ min │ 175ms 592µs 100ns │
│ median │ 193ms 747µs 300ns │
╰────────┴───────────────────╯
nushell with csv
❯ 1..100 | each {timeit {polars open Data7602DescendingYearOrder.csv | polars group-by year | polars agg (polars col geo_count | polars sum) | polars collect| null}} | | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 221ms 192µs 800ns │
│ min │ 147ms 270µs 200ns │
│ median │ 163ms 65µs 750ns │
╰────────┴───────────────────╯
nushell with parquet (the max and min are pretty far apart here for some reason)
❯ 1..100 | each {timeit {polars open data.parquet | polars group-by year | polars agg (polars col geo_count | polars sum) | polars collect| null}} | | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 491ms 423µs 200ns │
│ min │ 75ms 194µs │
│ median │ 91ms 373µs 200ns │
╰────────┴───────────────────╯
@vincev Also, I've noticed for sure that my battery is draining while the Mac is idle. So perhaps there is lots of IO happening. And I really believe that it is related to the 14.5 update.
@maxim-uvarov another test you can do is try with duckdb that uses a completely different code base and language and see if you notice the same variance:
1..100 | each {timeit {duckdb -c "select year, count(*) n, sum(geo_count) s from './Data7602DescendingYearOrder.csv' group by year order by 1" | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 266ms 566µs 625ns │
│ min │ 132ms 364µs 958ns │
│ median │ 138ms 10µs 667ns │
╰────────┴───────────────────╯
Maybe you can use Activity Monitor to see if it shows processes doing lots of IO following the upgrade.
@vincev Many thanks for keeping us in mind!
No problem with duckdb, as no problem with Python Polars. Everything runs very smoothly.
> 1..100 | each {timeit {duckdb -c "select year, count(*) n, sum(geo_count) s from './Data7602DescendingYearOrder.csv' group by year order by 1" | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
∙
╭────────┬───────────────────╮
│ max │ 119ms 656µs 708ns │
│ min │ 108ms 837µs 41ns │
│ median │ 113ms 817µs 229ns │
╰────────┴───────────────────╯
I'm going to learn how to use the Instruments
Mac profiler for my case. But before that, I want to complete some other computational tasks that are occupying my mind.
polars-cli is also slow with csv.
P.S. I renamed polars
to polars-cli
as this name is already used in nushell for the plugin
> timeit {polars-cli -c "select year, count(*) n, sum(geo_count) s from read_csv('Data7602DescendingYearOrder.csv') group by year;"}
┌──────┬────────┬──────────┐
│ year ┆ n ┆ s │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪════════╪══════════╡
│ 2007 ┆ 259065 ┆ 11348772 │
│ 2005 ┆ 255917 ┆ 10724928 │
│ 2010 ┆ 259891 ┆ 11513931 │
│ 2011 ┆ 260366 ┆ 11523552 │
│ 2015 ┆ 265129 ┆ 12305010 │
│ … ┆ … ┆ … │
│ 2014 ┆ 262630 ┆ 12004401 │
│ 2012 ┆ 260584 ┆ 11510928 │
│ 2003 ┆ 243470 ┆ 9458073 │
│ 2013 ┆ 260208 ┆ 11588178 │
│ 2001 ┆ 239301 ┆ 9034416 │
└──────┴────────┴──────────┘
13sec 69ms 280µs 83ns
@maxim-uvarov that's interesting, the Polars team just released version 0.40, I have update dply dependencies on main, I am not sure if it solves your problem maybe you can give it a try if it doesn't probably the best thing is to open an issue with the Polars team that shows the difference between duckdb and polars-cli.
@vincev I updated to dply 0.3.2
, yet it didn't help :( Thanks for the suggestion. I will create an issue in the polars-cli
repository. But before that, I want to try the instruments to gather more details. It seems like a handy skill and an interesting activity.
I just found this: https://github.com/cmyr/cargo-instruments?tab=readme-ov-file
I ran this
sudo flamegraph ...[-o my_flamegraph2.svg] -- dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()'
and got this:
I run it on my box, I rebuilt dply using:
[profile.release]
debug = true
I got the following:
It looks that in your case is spending much more time doing memory stuff but I am not sure what is causing that.
I also built it with
[profile.release]
debug = true
here is the picture.
The timing variance is quite big:
> 1..10 | each {timeit {dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬────────────────────────╮
│ max │ 8sec 971ms 429µs 833ns │
│ min │ 153ms 956µs 42ns │
│ median │ 3sec 9ms 338µs 500ns │
╰────────┴────────────────────────╯
It looks like is spending lots of time doing memory stuff, I don't really know what is causing that, maybe ask on the Polars discussions someone may have had experienced the same problem, it may be some MacOS configuration setting that has changed after the upgrade on your box.
Yesterday I collected 'sys' template information and loaded it into instruments. The file is 120 MB and full of data. Though it's really difficult to understand where to look for insights, I don't give up: I chat with GPT and compare profilings from loading Parquet.
Also, I contacted Apple support but was rather disappointed by their answers. I sent the issue via Feedback Assistant, yet I don't hope that they will investigate them properly.
Today im going to ask at polars git or in their chat
I just got my hands on a used min-spec Mac Mini M1 (8 GB RAM, 256 GB SSD). It's still running Sonoma 14.4 and I haven't upgraded it.
I will hold off on upgrading in case there is any more repro to be done. But so far it seems like 14.4 probably doesn't have this problem:
dply with CSV:
~/polars-slow-repro> 1..10 | each {timeit {dply -c 'csv("./Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 141ms 195µs 208ns │
│ min │ 110ms 584µs 334ns │
│ median │ 113ms 445µs 791ns │
╰────────┴───────────────────╯
dply with parquet:
~/polars-slow-repro> 1..100 | each {timeit {dply -c 'parquet("./data.parquet") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬───────────────────╮
│ max │ 113ms 848µs 542ns │
│ min │ 67ms 554µs 584ns │
│ median │ 71ms 265µs 938ns │
╰────────┴───────────────────╯
Nushell with CSV:
~/polars-slow-repro> 1..100 | each {timeit {polars open Data7602DescendingYearOrder.csv | polars group-by year | polars agg (polars col geo_count | polars sum) | polars collect| null}} | | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬──────────────────╮
│ max │ 80ms 650µs 500ns │
│ min │ 58ms 476µs 458ns │
│ median │ 61ms 510µs 500ns │
╰────────┴──────────────────╯
Nushell with Parquet:
~/polars-slow-repro> 1..100 | each {timeit {polars open data.parquet | polars group-by year | polars agg (polars col geo_count | polars sum) | polars collect| null}} | | {max: ($in | math max), min: ($in | math min), median: ($in | math median)}
╭────────┬──────────────────╮
│ max │ 48ms 886µs 584ns │
│ min │ 24ms 792µs 459ns │
│ median │ 26ms 442µs 521ns │
╰────────┴──────────────────╯
@vincev Question for you: are you on Apple Silicon or Intel?
My pet hypothesis at the moment is that this has something to do with GoFetch vulnerability mitigations, but I'm not sure. The 'involuntary context switches' kind of lines up with that idea too.
We discussed with Devyn that before upgrading, it might be useful to use Xcode instruments to obtain profiles of a system before it gets upgraded, which might cause the problem.
Here is a quick instruction to get started:
To get started, you need to install Xcode (it can be done via the App Store, but the download was very slow for me):
https://download.developer.apple.com/Developer_Tools/Xcode_15.4/Xcode_15.4.xip
In Xcode, choose:
After installation, if it refuses to work, you can try these commands (that found in my history, while was troubleshooting):
/usr/bin/xcodebuild -version
xcrun xctrace
Then install Cargo Instruments:
https://github.com/cmyr/cargo-instruments
In your clone directory of dply
, add this to Cargo.toml:
[profile.release]
debug = true
After that, from that folder, you can execute commands like:
cargo instruments -t 'Activity Monitor' -- dply -c 'csv("/Users/user/git/nushell.github.io/book/test/Data7602DescendingYearOrder.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()'
> cargo instruments -l
Xcode Instruments templates:
built-in abbrev
--------------------------
Activity Monitor
Allocations (alloc)
Animation Hitches
App Launch
Audio System Trace
CPU Counters
CPU Profiler
Core ML
Data Persistence
File Activity (io)
Game Memory
Game Performance
Leaks
Logging
Metal System Trace
Network
RealityKit Trace
SceneKit
Swift Concurrency
SwiftUI
System Trace (sys)
Tailspin
Time Profiler (time)
Zombies
I found those templates interesting:
Activity Monitor
io
sys
I can share my profiles. The largest one is over 100 MB. Apple says there might be some sensitive information to share it with the public, but I have nothing to hide. Therefore, I can easily share them with the participants of this conversation if they are willing.
I converted all the data to JSON. Reading from a JSON file five times bigger than CSV takes 100 times less time.
> open Data7602DescendingYearOrder.csv | each {to json -r} | str join (char nl) | save data.json -f
> 1..50 | each {timeit {dply -c 'json("./data.json") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬───────────────────╮
│ median │ 662ms 454µs 521ns │
│ min │ 540ms 567µs 916ns │
│ max │ 841ms 515µs 291ns │
│ stddev │ 69ms 853µs 152ns │
╰────────┴───────────────────╯
> 1..10 | each {timeit {dply -c 'csv("./data.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬─────────────────────────╮
│ median │ 7sec 317ms 54µs 541ns │
│ min │ 226ms 195µs 708ns │
│ max │ 21sec 938ms 538µs 750ns │
│ stddev │ 5sec 845ms 369µs 36ns │
╰────────┴─────────────────────────╯
though! If I remove the calculation of sum everything flies even with csv!
> 1..10 | each {timeit {dply -c 'csv("./data.csv") | group_by(year) | summarize(count = n()) | show()' | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬──────────────────╮
│ median │ 91ms 958µs 833ns │
│ min │ 89ms 849µs 917ns │
│ max │ 97ms 538µs 292ns │
│ stddev │ 2ms 284µs 986ns │
╰────────┴──────────────────╯
I guess I can compare profiles with sum operation and without.
With sum
❯ 1..50 | each {timeit {dply -c 'csv("./data.csv") | group_by(year) | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬────────────────────────╮
│ median │ 1sec 337ms 902µs 979ns │
│ min │ 220ms 778µs 209ns │
│ max │ 17sec 860ms 1µs 584ns │
│ stddev │ 2sec 662ms 457µs 129ns │
╰────────┴────────────────────────╯
Without sum
❯ 1..50 | each {timeit {dply -c 'csv("./data.csv") | group_by(year) | summarize(count = n()) | show()' | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬───────────────────────╮
│ median │ 94ms 813µs 416ns │
│ min │ 89ms 422µs 42ns │
│ max │ 1sec 12ms 978µs 375ns │
│ stddev │ 174ms 418µs 404ns │
╰────────┴───────────────────────╯
polars cli doesn't like sum either:
> 1..10 | each {timeit {polars-cli -c "select year, count(*) n from read_csv('Data7602DescendingYearOrder.csv') group by year;" | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬──────────────────╮
│ median │ 45ms 243µs 354ns │
│ min │ 43ms 239µs 875ns │
│ max │ 49ms 414µs 167ns │
│ stddev │ 2ms 12µs 745ns │
╰────────┴──────────────────╯
> 1..10 | each {timeit {polars-cli -c "select year, count(*) n, sum(geo_count) s from read_csv('Data7602DescendingYearOrder.csv') group by year;" | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬─────────────────────────╮
│ median │ 8sec 258ms 594µs 312ns │
│ min │ 3sec 202ms 705µs 958ns │
│ max │ 20sec 641ms 701µs 583ns │
│ stddev │ 5sec 21ms 84µs 752ns │
I inspected the geo_count
column visually.
I converted the file to CSV once again in case it had artifacts.
open data.csv | save data1.csv
Still no luck.
for a sake of synchronization: these are screenshots from nuhslell discord with @devyn
@vincev Question for you: are you on Apple Silicon or Intel?
@devyn I am running on a 16GB Apple M1.
Thanks, I guess that potentially eliminates that. Hmm...
@vincev (or any other rust-knowing participants) great Ritchie from Polars advised to try some things to improve performance here
Namely:
I see they run on the default allocator. Setting jemalloc on macos might resolve it. Further if performance is important they should consider activating simd, performant and maybe compiling binaries on a nightly compiler with avx features activated.
If you ever have time for this issue again - would you please help me to incorporate proposed settings into dply to test?
p.s. I'm sorry for occupying your attention.
Our polars
master is on vacation, so I can't reach him with this request.
Well, I think i can do that myself. At least im trying now
@maxim-uvarov okay let me know how it goes, in the meantime I can add jemalloc and activate some of the other features but you'll have to test it as I cannot reproduce it on my box.
Just pushed a branch with jemalloc in case it helps https://github.com/vincev/dply-rs/tree/perf
@vincev thank you so much! Helped a lot! And yes, it solved the problem!!!
> 1..100 | each {timeit {dply -c 'csv("./data.csv") | summarize(count = n(), sum = sum(geo_count)) | show()' | complete | null}} | {median: ($in | math median), min: ($in | math min), max: ($in | math max), stddev: ($in | into int | into float | math stddev | into int | $'($in).ns' | into duration --unit sec)}
╭────────┬──────────────────╮
│ median │ 75ms 471µs 521ns │
│ min │ 71ms 384µs 250ns │
│ max │ 85ms 949µs 625ns │
│ stddev │ 2ms 635µs 56ns │
╰────────┴──────────────────╯
@vincev @devyn @fdncred, especially @vincev Thank you so much for your time, help and participation! I'm super grateful!!!
@maxim-uvarov you are welcome I'll do a release with jemalloc then.
Hi! I was testing the
polars
plugin for nushell, and found that it became super slow after upgrading to macOS 14.5. I checked the same benchmarks withdply
and found that it also runs way too slow! Also, some other person reproduced slowliness on macos 14.5, and, I tested thepolars
nushell plugin under macos 14.2 and it runs quickly there.Dear @vincev, I'm wondering if you have a macOS to test nearby and if you have any thoughts on the problem below.
I downloaded and unpacked this dataset.
I executed this command in the terminal:
So we can see that it took 5 seconds, while it should run in 0.5s (based on my intuitions).
Plus, what I found is that timings are really unstable - sometimes much lower, sometimes much higher.
Though without
sum = sum(geo_count)
timings are like normal:P.S. I also ran the python script, and it goes with no problem whatsoever - very fast as usual: