Closed tdhock closed 4 months ago
Good question
The default worker pool size is likely the double of number of cores. I did not know it was possible to modify, but I found this function in py-polars docs and learned you can use Sys.setenv("POLARS_MAX_THREADS"=3)
before loading library polars
polars:::test_get_pool_size()
is not published yet but it shows that envvar option seems to work
> polars:::test_get_pool_size()
[1] 8
Restarting R session...
> library(polars)
Restarting R session...
> Sys.setenv("POLARS_MAX_THREADS"=3)
> library(polars)
> polars:::test_get_pool_size()
[1] 3
The thread pool could be used I guess to limit cores, but it is not completely exact:
on top of this r-polars always reserves one extra thread for running the R interpreter. If not injecting R code into polars with $map() or $apply(), that thread will mostly be sleeping.
extendr also uses a thread for catching rust panics , but that should be very light weight
rust-polars could potentially here and there spawns a few threads for special tasks
Hi again, thanks for the quick response. That is helpful to know that we can set an environment variable to set the max number of threads before polars is loaded, but is there a way to change it after it has already been loaded/attached via library(polars)? (similar to data.table::setDTthreads) Also that is beneficial to know that a thread is reserved for running the R interpreter. So does that mean the minimal number for POLARS_MAX_THREADS is 0? or is it 1? Would be nice to see some documentation about this topic at some point. For context I am trying to do some benchmarking of single threaded vs multi-threaded csv reading code. Also, is there a polars function for csv writing? I did not see docs about that.
So does that mean the minimal number for POLARS_MAX_THREADS is 0? or is it 1?
POLARS_MAX_THREADS could have been named POLARS_RAYON_POOL_SIZE. I don't know what would happen on 0. Either 0 is not allowed or rayon hang I guess. Rayon is a rust crate which implements a thread pool. 1 would be lowest meaningful value I guess.
in the rust-polars for the csv reader and other readers it is possible to specifically decide on threads. Likely because these readers come from external crates with their own threadding implementation. These are likely not bounded by POLARS_MAX_THREADS. I will look into if r-polars could easily expose these reader thread settings.
Let's also keep this issue open until these thread findings have been added to docs.
This issue may be necessary for the CRAN release.
The CRAN policy requires that the number of logical threads used on a CRAN machine be limited to 2. (For example, see this thread on the R package devel mailing list https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009454.html)
When submitting polars to CRAN, we may need to limit the number of threads, e.g. for tests.
We'll probably have to insert hidden code within every example to limit the number of threads. It's impractical to do it manually, so I think you'll need to rewrite the Rd file with a script like https://github.com/pola-rs/r-polars/issues/560#issuecomment-1837412864.
Useful resources:
It's really surprising that CRAN is forcing package developers to set the default to 2. Polars currently does not have a way to change the number of threads after loading, so we may have to opt in to something other than 2 any time use r-polars if we submit to CRAN.
I came up with a solution of the default thread number: introduce a new feature flag not_set_max_threads
in the Rust library, which automatically sets the POLARS_MAX_THREADS=2
environment variable if that flag is not set and the POLARS_MAX_THREADS
environment variable is empty via the .onLoad()
function.
If this feature is explicitly selected and built, POLARS_MAX_THREADS
is not set and the maximum number of cores is defaulted.
Something like:
if (!check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == "") Sys.setenv(POLARS_MAX_THREADS = 2)
@sorhawell @etiennebacher What do you think about this?
Shouldn't the if
condition be check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == ""
?
We want to set the number of threads to 2 if not_set_max_threads
is TRUE
, right?
In my proposal not_set_max_threads
is not included in the default feature.
This means the following:
[features]
default = []
not_set_max_threads = []
In other words, for builds that do not explicitly include not_set_max_threads
, the number of threads is automatically set to 2 unless the number of threads is specified by POLARS_MAX_THREADS
.
Conversely, for builds that include not_set_max_threads
(GitHub build), the number of threads is automatically determined by Polars (current behave).
if (!check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == "") Sys.setenv(POLARS_MAX_THREADS = 2)
Hi I was little confused of "not_set_max_threads", I think of it as "do_not_limit_max_threads". I hope I got it right.
As I read the email exchange, the CRAN team was given plenty of good reasons from smart people, why not to enforce such a limit - but it is what it is. I recommend also running with "not_set_max_threads" for R-universe, as that is our main recommended channel to install from. I guess data.table would neither limit thread for any non CRAN release channel.
The POLARS_MAX_THREADS likely only apply to the polars thread pool size. Our own rust code also spawns threads. When ever using collect()
an extra thread will be spawned to interface polars multiple threads with the single R session. It will sleep a lot though. Our in_background
can spawn extra threads too.
Maybe our interface code could be rewritten to be limited by the polars threadpool also. Not sure. Maybe it could cause some deadlock situation, because polars workers-thread must wait and sleep for the R-handler-thread and vice versa. Other rust crates may freely spawn threads too.
I understand CRAN refuses to have any "IS_CRAN" environment variable, and therefore everybody else just have to hardcode CRAN policies to the default and revert backwards with environment variables. This opens up for some user/dev foot guns. Just a suggestion. How about using an automated CRAN release script/macro which alters the hardcoded defaults of any code bundle submitted to CRAN? Then CRAN can have it in one way, alongside a repository that retain the most sensible defaults. Of course there are also many wise / non-contoversial CRAN policies, which should just be adapted immediately everywhere.
@eitsupi it is you who put a lot of hard work and thought into this. If you see this as the best way, then I applaud that too.
Thank you for your reply.
I strongly oppose patching source code submitted to CRAN. That's because people who want that feature won't necessarily install it from CRAN, and it's very likely to mislead people who read the source code on GitHub.
In other words, the issue of whether the default number of threads is unlimited or 2 is irrelevant to CRAN, and I think users should be able to choose between them no matter where they install from. Since R packages don't have a mechanism like feature flags, I just suggested that it could be easily implemented using Rust's feature flags.
Once #693 is merged, users will be able to freely select features during installation, making it easier to add features.
Implemented in #720.
The other crate had a function called "disable_*", so I named the feature as disable_auto_limit_max_threads
.
Of course this is enabled on binaries built on GitHub (includes R-universe).
great, thanks!
related: can you please document some way to query the number of threads that polars is using?
I guess it is possible using polars:::test_get_pool_size()
?
btw, gsoc'24 is coming up, so if you want to participate again with R project, please add an idea to our wiki, https://github.com/rstats-gsoc/gsoc2024/wiki/table%20of%20proposed%20coding%20projects
I guess it is possible using
polars:::test_get_pool_size()
?
pl$threadpool_size()
is what you're looking for (polars:::test_get_pool_size()
was removed)
btw, gsoc'24 is coming up
I won't be able to participate, maybe @eitsupi ?
@tdhock I was blessed with one more child this year, and somehow any spare time went up in smoke for me :) It was pleasure to mentor @Sicheng-Pan last year!
Sicheng may consider mentoring this year
I don't think I'm good enough to be a mentor here.
I would like to help, but I'm sure that I am way less experienced than anyone here
well I'm sure you know more about polars than I do! I could co-mentor again, if you need somebody with basic R expertise.
hi @sorhawell is there a function that I can call to tell polars the max number of threads I want it to use? something like data.table::setDTthreads(4) if I only want to use 4 threads max. Thanks! I searched the documentation for "thread" and got a few hits but nothing that helped.