pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
405 stars 35 forks source link

how to config max number of threads? #241

Closed tdhock closed 4 months ago

tdhock commented 11 months ago

hi @sorhawell is there a function that I can call to tell polars the max number of threads I want it to use? something like data.table::setDTthreads(4) if I only want to use 4 threads max. Thanks! I searched the documentation for "thread" and got a few hits but nothing that helped.

sorhawell commented 11 months ago

Good question

The default worker pool size is likely the double of number of cores. I did not know it was possible to modify, but I found this function in py-polars docs and learned you can use Sys.setenv("POLARS_MAX_THREADS"=3) before loading library polars

polars:::test_get_pool_size() is not published yet but it shows that envvar option seems to work

> polars:::test_get_pool_size()
[1] 8

Restarting R session...

> library(polars)

Restarting R session...

> Sys.setenv("POLARS_MAX_THREADS"=3)
> library(polars)
> polars:::test_get_pool_size()
[1] 3

The thread pool could be used I guess to limit cores, but it is not completely exact:

on top of this r-polars always reserves one extra thread for running the R interpreter. If not injecting R code into polars with $map() or $apply(), that thread will mostly be sleeping.

extendr also uses a thread for catching rust panics , but that should be very light weight

rust-polars could potentially here and there spawns a few threads for special tasks

tdhock commented 11 months ago

Hi again, thanks for the quick response. That is helpful to know that we can set an environment variable to set the max number of threads before polars is loaded, but is there a way to change it after it has already been loaded/attached via library(polars)? (similar to data.table::setDTthreads) Also that is beneficial to know that a thread is reserved for running the R interpreter. So does that mean the minimal number for POLARS_MAX_THREADS is 0? or is it 1? Would be nice to see some documentation about this topic at some point. For context I am trying to do some benchmarking of single threaded vs multi-threaded csv reading code. Also, is there a polars function for csv writing? I did not see docs about that.

sorhawell commented 11 months ago

So does that mean the minimal number for POLARS_MAX_THREADS is 0? or is it 1?

POLARS_MAX_THREADS could have been named POLARS_RAYON_POOL_SIZE. I don't know what would happen on 0. Either 0 is not allowed or rayon hang I guess. Rayon is a rust crate which implements a thread pool. 1 would be lowest meaningful value I guess.

in the rust-polars for the csv reader and other readers it is possible to specifically decide on threads. Likely because these readers come from external crates with their own threadding implementation. These are likely not bounded by POLARS_MAX_THREADS. I will look into if r-polars could easily expose these reader thread settings.

Let's also keep this issue open until these thread findings have been added to docs.

eitsupi commented 7 months ago

This issue may be necessary for the CRAN release.

The CRAN policy requires that the number of logical threads used on a CRAN machine be limited to 2. (For example, see this thread on the R package devel mailing list https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009454.html)

When submitting polars to CRAN, we may need to limit the number of threads, e.g. for tests.

eitsupi commented 5 months ago

We'll probably have to insert hidden code within every example to limit the number of threads. It's impractical to do it manually, so I think you'll need to rewrite the Rd file with a script like https://github.com/pola-rs/r-polars/issues/560#issuecomment-1837412864.

eitsupi commented 5 months ago

We should add this function. https://github.com/pola-rs/polars/blob/11531247bc824d2c0fe8c5ed5e8afc0b93e6db10/py-polars/src/functions/meta.rs#L21-L23

eitsupi commented 5 months ago

Useful resources:

It's really surprising that CRAN is forcing package developers to set the default to 2. Polars currently does not have a way to change the number of threads after loading, so we may have to opt in to something other than 2 any time use r-polars if we submit to CRAN.

eitsupi commented 5 months ago

I came up with a solution of the default thread number: introduce a new feature flag not_set_max_threads in the Rust library, which automatically sets the POLARS_MAX_THREADS=2 environment variable if that flag is not set and the POLARS_MAX_THREADS environment variable is empty via the .onLoad() function. If this feature is explicitly selected and built, POLARS_MAX_THREADS is not set and the maximum number of cores is defaulted.

Something like:

if (!check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == "") Sys.setenv(POLARS_MAX_THREADS = 2) 

@sorhawell @etiennebacher What do you think about this?

etiennebacher commented 5 months ago

Shouldn't the if condition be check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == ""?

We want to set the number of threads to 2 if not_set_max_threads is TRUE, right?

eitsupi commented 5 months ago

In my proposal not_set_max_threads is not included in the default feature. This means the following:

[features]
default = []
not_set_max_threads = []

In other words, for builds that do not explicitly include not_set_max_threads, the number of threads is automatically set to 2 unless the number of threads is specified by POLARS_MAX_THREADS. Conversely, for builds that include not_set_max_threads (GitHub build), the number of threads is automatically determined by Polars (current behave).

if (!check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == "") Sys.setenv(POLARS_MAX_THREADS = 2) 
sorhawell commented 5 months ago

Hi I was little confused of "not_set_max_threads", I think of it as "do_not_limit_max_threads". I hope I got it right.

"not_set_max_threads" on R-universe channel?

As I read the email exchange, the CRAN team was given plenty of good reasons from smart people, why not to enforce such a limit - but it is what it is. I recommend also running with "not_set_max_threads" for R-universe, as that is our main recommended channel to install from. I guess data.table would neither limit thread for any non CRAN release channel.

There are some other threads, we may need to deal with eventually

The POLARS_MAX_THREADS likely only apply to the polars thread pool size. Our own rust code also spawns threads. When ever using collect() an extra thread will be spawned to interface polars multiple threads with the single R session. It will sleep a lot though. Our in_background can spawn extra threads too. Maybe our interface code could be rewritten to be limited by the polars threadpool also. Not sure. Maybe it could cause some deadlock situation, because polars workers-thread must wait and sleep for the R-handler-thread and vice versa. Other rust crates may freely spawn threads too.

suggestion how about not CRAN is the general default?

I understand CRAN refuses to have any "IS_CRAN" environment variable, and therefore everybody else just have to hardcode CRAN policies to the default and revert backwards with environment variables. This opens up for some user/dev foot guns. Just a suggestion. How about using an automated CRAN release script/macro which alters the hardcoded defaults of any code bundle submitted to CRAN? Then CRAN can have it in one way, alongside a repository that retain the most sensible defaults. Of course there are also many wise / non-contoversial CRAN policies, which should just be adapted immediately everywhere.

sorhawell commented 5 months ago

@eitsupi it is you who put a lot of hard work and thought into this. If you see this as the best way, then I applaud that too.

eitsupi commented 5 months ago

Thank you for your reply.

I strongly oppose patching source code submitted to CRAN. That's because people who want that feature won't necessarily install it from CRAN, and it's very likely to mislead people who read the source code on GitHub.

In other words, the issue of whether the default number of threads is unlimited or 2 is irrelevant to CRAN, and I think users should be able to choose between them no matter where they install from. Since R packages don't have a mechanism like feature flags, I just suggested that it could be easily implemented using Rust's feature flags.

eitsupi commented 4 months ago

Once #693 is merged, users will be able to freely select features during installation, making it easier to add features.

eitsupi commented 4 months ago

Implemented in #720. The other crate had a function called "disable_*", so I named the feature as disable_auto_limit_max_threads. Of course this is enabled on binaries built on GitHub (includes R-universe).

tdhock commented 4 months ago

great, thanks! related: can you please document some way to query the number of threads that polars is using? I guess it is possible using polars:::test_get_pool_size() ?

tdhock commented 4 months ago

btw, gsoc'24 is coming up, so if you want to participate again with R project, please add an idea to our wiki, https://github.com/rstats-gsoc/gsoc2024/wiki/table%20of%20proposed%20coding%20projects

etiennebacher commented 4 months ago

I guess it is possible using polars:::test_get_pool_size() ?

pl$threadpool_size() is what you're looking for (polars:::test_get_pool_size() was removed)

btw, gsoc'24 is coming up

I won't be able to participate, maybe @eitsupi ?

sorhawell commented 4 months ago

@tdhock I was blessed with one more child this year, and somehow any spare time went up in smoke for me :) It was pleasure to mentor @Sicheng-Pan last year!

tdhock commented 4 months ago

Sicheng may consider mentoring this year

eitsupi commented 4 months ago

I don't think I'm good enough to be a mentor here.

Sicheng-Pan commented 4 months ago

I would like to help, but I'm sure that I am way less experienced than anyone here

tdhock commented 4 months ago

well I'm sure you know more about polars than I do! I could co-mentor again, if you need somebody with basic R expertise.