pola-rs / r-polars

Polars R binding
https://pola-rs.github.io/r-polars/
Other
465 stars 36 forks source link

CRAN Release? #80

Closed eitsupi closed 3 months ago

eitsupi commented 1 year ago

The Rust version of Debian testing will soon be 1.65, which may allow us to build rpolars tuned to build in the R-universe. (Note that the macOS builder's Rust version may be older than Debian's. eitsupi/prqlr#94) https://tracker.debian.org/pkg/rustc https://qa.debian.org/excuses.php?experimental=1&package=rustc

@sorhawell Could you potentially do a CRAN release? Probably need to modify build scripts (like #29) and documentation (Of course I would like to contribute to that.).

sorhawell commented 1 year ago

@mrwunderbar666 many thanks to take a stab at this!! <3 I'm currently on vacation and will return in a few days.

mrwunderbar666 commented 1 year ago

After doing some research, I saw that the hellorust example comes with a lot of the required solutions for CRAN packaging. Therefore, I will ping you when I implemented a cleaner solution

eitsupi commented 1 year ago

Do you know of any packages on CRAN that download tarballs from the web? In other words, all existing packages on CRAN that use Rust (gifski, string2path, hellorust, rshift, tok) vendor dependent crates. But that is not possible here (> 10MB), and it is unclear how to work around it. I think that is the only problem here.

The big question is where and who should manage the tarballs anyway (I don't want to do that).

sorhawell commented 1 year ago

I think this won't be necessary, because cargo vendor includes all files for offiline compilation

Oh nice I did not think of that.

if this requires a larger tarball a modestly increased limit can be requested at submission.

We can hope for a particle physicist reviewing if 340% is a modest increase :)

eitsupi commented 1 year ago

We can hope for a particle physicist reviewing if 340% is a modest increase :)

Note that the 12MB prqlr was rejected.

yutannihilation commented 1 year ago

On prqlr's case, it might be worth trying to get rid of winapi, which I struggled a while ago. It's huge.

https://github.com/extendr/libR-sys/issues/107

winapi is a dependency of stacker, which is a dependency of chumsky. The dev version of stacker already switched it to windows-sys. I'm not sure how lighter windows-sys is than winapi, though.

https://github.com/rust-lang/stacker/pull/76

On polars' case, it's a crossterm's dependency, and it seems it won't be replaced very soon, unfortunately (Honestly, I'm not sure why the Rust code dedicated for an R package needs to consider Windows terminals).

https://github.com/crossterm-rs/crossterm/pull/718

mrwunderbar666 commented 1 year ago

Do you know of any packages on CRAN that download tarballs from the web? In other words, all existing packages on CRAN that use Rust (gifski, string2path, hellorust, rshift, tok) vendor dependent crates. But that is not possible here (> 10MB), and it is unclear how to work around it. I think that is the only problem here.

The big question is where and who should manage the tarballs anyway (I don't want to do that).

Ok, good point. I would have a unprecedented solution (at least I think it would be). One could use Zenodo to upload the vendored tarball, and then the package installer would retrieve the tarball from there

From their website

Why use Zenodo? Safe — your research is stored safely for the future in CERN’s Data Centre for as long as CERN exists.

sorhawell commented 1 year ago

Just learned about the zenodo project. It looks cool.

eitsupi commented 1 year ago

@yutannihilation Thank you for the detailed look at dependencies!

On prqlr's case, it might be worth trying to get rid of winapi, which I struggled a while ago. It's huge.

extendr/libR-sys#107

winapi is a dependency of stacker, which is a dependency of chumsky. The dev version of stacker already switched it to windows-sys. I'm not sure how lighter windows-sys is than winapi, though.

rust-lang/stacker#76

I didn't know about this. thank you. However, I think removing this dependency is beyond my capabilities.

On polars' case, it's a crossterm's dependency, and it seems it won't be replaced very soon, unfortunately (Honestly, I'm not sure why the Rust code dedicated for an R package needs to consider Windows terminals).

crossterm-rs/crossterm#718

I think r-polars has pulled in unnecessary functionality of rust-polars (it looks quite different compared to python-polars).

I think removing these unnecessary features would definitely reduce the size. That said, I still don't think we can get around the 10MB limit.


I really don't want to use Zenodo. Every time we update the Cargo.lock file we have to upload a huge tarball to there?

mrwunderbar666 commented 1 year ago

I really don't want to use Zenodo. Every time we update the Cargo.lock file we have to upload a huge tarball to there?

I know it's not the most elegant solution. But I my idea is to only update the tarball with vendored source code when there is a CRAN release. This would also imply that the CRAN releases are rare. Users who want to use the latest features would use R-Universe instead

sorhawell commented 1 year ago

I really don't want to use Zenodo. Every time we update the Cargo.lock file we have to upload a huge tarball to there?

I'm neutral, if it works and accepted, it seems ok. I see there is some github integration, I guess it could be incorporated in a CRAN release workflow.

But I my idea is to only update the tarball with vendored source code when there is a CRAN release.

I suppose it would be a new unique upload and any cran release would just be hardcoded* in the Makevars to point to the tarball it needs for installation.

(*) hardcoded url should be set by a release script or in some config file.

I think r-polars has pulled in unnecessary functionality of rust-polars (it looks quite different compared to python-polars).

py-polars and r-polars implementations are quite different when bridging multi-threading to the interpreter. py-polars can rely much more on PyO3 and python native threads, where r-polars e.g. uses crate flume and custom code. r-polars now also supports background queries which does not block the R interpreter and also supports having a pool of R interpreters to process R code in parallel. The ipc-channel crate is used to send R objects across processes via shared memory.

Also the dependency stack of extendr-api and PyO3 is not the same.

My impression is that pulling in a handful low-level crates does not change the final binary much. It is the ~220 crates pulled in total which makes a big binary.

eitsupi commented 1 year ago

I know it's not the most elegant solution. But I my idea is to only update the tarball with vendored source code when there is a CRAN release. This would also imply that the CRAN releases are rare. Users who want to use the latest features would use R-Universe instead

As far as I know, there is no way to tell when a package is built whether it is on R-universe or on CRAN. (R-universe builders are configured to intentionally remove all environment variables specific to GitHub Actions to avoid being distinguished from CRAN.) In other words, on R-universe, builds and tests run the same as on CRAN.

eitsupi commented 1 year ago

@sorhawell Thanks for correcting my lack of understanding. However, I believe that currently (or a while ago) r-polars was also building apparently unrelated crates (polars features) such as polars-sql. Do you think those have been completely removed?

sorhawell commented 1 year ago

As far as I know, there is no way to tell when a package is built whether it is on R-universe or on CRAN.

I have no proof of concept but since we get to choose what we submit for CRAN, we could have a github workflow something like this:

etiennebacher commented 1 year ago

Sorry if I missed something, but I don't understand how using Zenodo helps for CRAN submission. I understood that CRAN wants the package build to be in total isolation so that there's no download of external dependencies when the user installs the package, which is why they said that even downloading crates from crates.io was not an option.

Why is Zenodo a solution in this case? Is it considered as a "trusted source" by CRAN?

sorhawell commented 1 year ago

It seems from prsqlr that a tarball size of 12MB is unacceptably above a CRAN limit 10MB. We are allegedly at ~30MB? We cannot slim down to 10MB I think.

@etiennebacher Was there a line saying "in total isolation" in new rules? I think I saw "no downloading of binaries."

sorhawell commented 1 year ago

such as polars-sql. Do you think those have been completely removed?

it seems rust-polars features ipc, parquet, csv and json trigger polars-sql dependency. I don't know if we can cherry-pick sub features within rust-polars. I tried naively to replace e.g. "parquet" with the sub features ["polars-io", "polars-core/parquet", "polars-lazy/parquet", "polars-io/parquet", "polars-sql/parquet"], but the compiler said I could not do that, I think.

mrwunderbar666 commented 1 year ago

I suggested using zenodo, after reading the CRAN submission policies in detail. Here are the relevant sections:

If the sources are too large, it is acceptable to download them as part of installation, but do ensure that the download is of a fixed version rather than the latest. Only as a last resort and with the agreement of the CRAN team should a package download pre-compiled software.

Specifically for rust it says:

Packages wishing to use Rust code should either

include the code in the package, or download a specific version from a secure and reliable site and check that the download is the expected code by some sort of checksum. The expected checksum needs to be embedded in the source package.

etiennebacher commented 1 year ago

@etiennebacher Was there a line saying "in total isolation" in new rules? I think I saw "no downloading of binaries."

You're right @sorhawell, I misread the guidelines for Rust. However, as @mrwunderbar666 said above, the CRAN policies say

Packages wishing to use Rust code should either include the code in the package, or download a specific version from a secure and reliable site

but AFAICT nowhere in CRAN policies do they mention Zenodo. So I think before anything we should ask them directly if this would be acceptable to them (and if they agree with this submitting strategy overall).

eitsupi commented 1 year ago

I believe this is where Zenodo's name came up. (It is truly surprising that they equate Crates.io with GitHub.)

https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009332.html

Yutani,

I'm not quite sure your reading fully matches the intent of the policy. Cargo.lock is not sufficient, it is expected that the package will provide all the sources, it is not expected to use cargo to resolve them from random (possibly inaccessible) places. So the package author is expected to either include the sources in the package or (if prohibitive due to extreme size) have a release tar ball available at a fixed, secure, reliable location (I was recommending Zenodo.org for that reason - GitHub is neither fixed nor reliable by definition).

Based on that, I'm not sure I fully understand the scope of your proposal for improvement. Carlo.lock is certainly the first step that the package author should take in creating the distribution tar ball so you can fix the versions, but it is not sufficient as the next step involves collecting the related sources. We don't want R users to be involved in that can of worms (especially since the lock file itself provides no guarantees of accessibility of the components and we don't want to have to manually inspect it), the package should be ready to be used which is why it has to do that step first. Does that explain the intent better? (In general, the downloading at install time is actually a problem, because it's not uncommon to use R in environments that have no Internet access, but the download is a concession for extreme cases where the tar balls may be too big to make it part of the package, but it's yet another can of worms...).

Cheers, Simon

sorhawell commented 1 year ago

@mrwunderbar666 I'm sorry to say when bump r-polars dependency to rust-polars to 0.32.1 the minimal required version of rustc is now 1.70 for without SIMD and rust nightly-2023-07-27 for with. CRAN only supports 1.65 or 1.66 or something like that.

I think we have hit another hard wall. rust-polars have made no promise of only using the about 2 years older rustc versions released via debian as CRAN uses.

Thank you for looking into alternatives, but I think this is show stopper for now.

eitsupi commented 1 year ago

rust-polars have made no promise of only using the about 2 years older rustc versions released via debian as CRAN uses.

To Debian's honor: Rust 1.66 was released in December 2022, which is still less than a year old. https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1660-2022-12-15

mrwunderbar666 commented 1 year ago

Thanks for looking into this, I haven't considered the version of rustc :/

I'll be looking around for other ways of getting polars to CRAN, but I guess for now this should not be a priority for package development. As a Linux user, I could install the package from r-universe without problems, so it would suffice for now

EDIT: should not be a priority

eitsupi commented 1 year ago

@mrwunderbar666

I guess for now this should be a priority for package development.

I would like to agree with that, but I think we did our best (see #308 for example) and as a result it looks impossible.

CRAN says https://github.com/pola-rs/r-polars/issues/80#issuecomment-1646735125

This was seen using 1500% CPU on the Fedora check system during installation. As in

  • checking whether package ‘polars’ can be installed ... [6192s/525s] ERROR

which has also exceeded the allowed CPU time limit on that system.

Checking in R-universe, the polars package does indeed have a longer build time than arrow and duckdb. In other words, it is very likely that it cannot be built in the time allowed by CRAN.

eitsupi commented 1 year ago

For example, the arrow package turns off multiple flags in the build on CRAN (I believe), so it may be possible to reduce build time by reducing the rust-polars features in the same way here. So someone may want to work on that.

Edit: I think the arrow package is installed almost completely by default now.

mrwunderbar666 commented 1 year ago

@eitsupi Sorry, mistyped! I meant it should not be a priority

eitsupi commented 1 year ago

In other words, all existing packages on CRAN that use Rust (gifski, string2path, hellorust, rshift, tok) vendor dependent crates. But that is not possible here (> 10MB), and it is unclear how to work around it.

CRAN has approved 12MB of prqlr source code. Likewise in polars, source package size may no longer be an issue.

Since Debian may soon reach Rust 1.70, it may be worthwhile to try again to submit to CRAN in a few months.

My concern is that the build time is too long... Checking on R-universe, duckdb was built in 30 minutes while polars took over 50 minutes. https://github.com/r-universe/hannes/actions/runs/6105495163 https://github.com/r-universe/pola-rs/actions/runs/6137342310

eitsupi commented 11 months ago

It seems that the initial arrow package on CRAN did not include libarrow. https://github.com/cran/arrow/tree/7d575c8eacb1757dcb38cc4115bf0ece39690b79

In fact, we can confirm that nothing works when this version is installed.

> remotes::install_version("arrow", version = "0.14.1")
Downloading package from url: https://packagemanager.posit.co/cran/__linux__/jammy/latest/src/contrib/Archive/arrow/arrow_0.14.1.tar.gz
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

1: All
2: CRAN packages only
3: None
4: withr (2.5.0 -> 2.5.1) [CRAN]

Enter one or more numbers, or an empty line to skip updates: 1
withr      (2.5.0 -> 2.5.1) [CRAN]
bit        (NA    -> 4.0.5) [CRAN]
tidyselect (NA    -> 1.2.0) [CRAN]
bit64      (NA    -> 4.0.5) [CRAN]
assertthat (NA    -> 0.2.1) [CRAN]
Installing 5 packages: withr, bit, tidyselect, bit64, assertthat
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://packagemanager.posit.co/cran/__linux__/jammy/latest/src/contrib/withr_2.5.1.tar.gz'
Content type 'binary/octet-stream' length 228097 bytes (222 KB)
==================================================
downloaded 222 KB

trying URL 'https://packagemanager.posit.co/cran/__linux__/jammy/latest/src/contrib/bit_4.0.5.tar.gz'
Content type 'binary/octet-stream' length 1130781 bytes (1.1 MB)
==================================================
downloaded 1.1 MB

trying URL 'https://packagemanager.posit.co/cran/__linux__/jammy/latest/src/contrib/tidyselect_1.2.0.tar.gz'
Content type 'binary/octet-stream' length 219931 bytes (214 KB)
==================================================
downloaded 214 KB

trying URL 'https://packagemanager.posit.co/cran/__linux__/jammy/latest/src/contrib/bit64_4.0.5.tar.gz'
Content type 'binary/octet-stream' length 475413 bytes (464 KB)
==================================================
downloaded 464 KB

trying URL 'https://packagemanager.posit.co/cran/__linux__/jammy/latest/src/contrib/assertthat_0.2.1.tar.gz'
Content type 'binary/octet-stream' length 52457 bytes (51 KB)
==================================================
downloaded 51 KB

* installing *binary* package ‘withr’ ...
* DONE (withr)
* installing *binary* package ‘bit’ ...
* DONE (bit)
* installing *binary* package ‘assertthat’ ...
* DONE (assertthat)
* installing *binary* package ‘tidyselect’ ...
* DONE (tidyselect)
* installing *binary* package ‘bit64’ ...
* DONE (bit64)

The downloaded source packages are in
        ‘/tmp/RtmpcEbl7u/downloaded_packages’
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
* installing *binary* package ‘arrow’ ...
* DONE (arrow)
> library(arrow)
Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

The following objects are masked from ‘package:base’:

    array, table

> set.seed(24)
> tab <- arrow::table(x = 1:10, y = rnorm(10))
Error in Table__from_dots(dots, schema) :
  Cannot call Table__from_dots(). Please use arrow::install_arrow() to install required runtime libraries.

Similarly, it may be worthwhile here to reduce functionality from the default feature to the bare minimum to reduce build time on CRAN......

eitsupi commented 10 months ago

I'm still considering the possibility of a CRAN release, but given the frequency of breaking changes with frequent breaking changes upstream, maybe it's better to avoid CRAN releases for a while?

etiennebacher commented 10 months ago

I don't see why this is a problem as long as we advertise well that there are still a lot of breaking changes upstream. The way I see it, releasing on CRAN is just a way to make it easier for people to install the package. We should emphasize that it shouldn't be used in other packages for now, but I don't see the number of breaking changes as a blocker.

If you still prefer to wait for upstream polars to be more stable, they plan to release 1.0 by the end of the year: https://github.com/pola-rs/polars/issues/6616#issuecomment-1736901318

eitsupi commented 10 months ago

I don't see why this is a problem as long as we advertise well that there are still a lot of breaking changes upstream.

My intention was that it would take a lot of effort to fix with reverse dependency checking.

eitsupi commented 8 months ago

I think almost all the issues have been resolved, so if Rust on CRAN catches up to MSRV of polars, I would like to submit to CRAN again.

However, Rust on CRAN has been stagnant for more than half a year due to the outdated Fedora server becoming a bottleneck, so we don't know when that will happen. https://stat.ethz.ch/pipermail/r-package-devel/2023q4/010147.html

etiennebacher commented 8 months ago

Supposing that the Rust version of CRAN matches the MSRV of polars, would we actually be ready to submit? If I remember correctly one of the main issues was also the package size, do we have a clear strategy for this?

eitsupi commented 8 months ago

I think the package size issue has now been resolved. The size of prqlr 0.7.0 is currently 15MB, making it the third largest package on CRAN, but it was released quickly.

etiennebacher commented 8 months ago

Wasn't polars much larger than that? This comment mentioned 34MB

eitsupi commented 8 months ago

Wasn't polars much larger than that? This comment mentioned 34MB

The point here is that packages larger than 5MB are usually not allowed. I don't think there is any difference between 15MB and 34MB in the sense that it exceeds 5MB. (Maybe there is, but we won't know until we ask CRAN)

david-cortes commented 6 months ago

Wasn't polars much larger than that? This comment mentioned 34MB

The point here is that packages larger than 5MB are usually not allowed. I don't think there is any difference between 15MB and 34MB in the sense that it exceeds 5MB. (Maybe there is, but we won't know until we ask CRAN)

Not sure if this is applicable to rust packages, but if e.g. header files are required from other libraries and they make up substantial space, one potential workaround could be to create multiple packages containing only those headers under inst, and then add those packages as dependencies under LinkingTo in DESCRIPTION.

eitsupi commented 6 months ago

Thank you for your comment. Cargo supports vendoring very well and the CRAN team is aware of the package size issue, so it's probably OK for Rust to just include all the source code in the package.

eitsupi commented 6 months ago

Given the low update frequency of Rust on CRAN (they are currently using Fedora 36 from EOL for almost a year now, and they are at Rust 1.69 with no plans to be updated in the future) and the high update frequency of MSRV on (Rust) Polars, I don't know if we should submit polars to CRAN even if they temporarily meet the MSRV requirements.

One option would be to start CRAN releases after the infrastructure on CRAN has improved and there is a guarantee that Rust will be regularly updated. (See https://github.com/RConsortium/r-repositories-wg)

etiennebacher commented 6 months ago

(See RConsortium/r-repositories-wg)

I don't see any discussion on Rust on this repo, am I missing something or do you mean that we should initiate the discussion there?

eitsupi commented 6 months ago

@etiennebacher This is a problem with the current CRAN infrastructure, not with Rust support. In other words, the current CRAN may be stuck with the Rust version 1.69 forever because it does not know if Fedora 36 will be used until a week, a year, or 10 years from now. I suspect this situation will continue until the current infrastructure, which volunteers have spent a great deal of effort to maintain, is completely renewed.

If updated to container-based IaC, at least it will not happen that Linux distributions that have reached EOL will remain in use for years to come.

etiennebacher commented 3 months ago

I think we should close this issue. The main blocker is the Rust version on the current CRAN infrastructure and there's nothing we can do about this. We can reopen when there's some news about this. @eitsupi do you agree?