meztez / bigrquerystorage

R Client for BigQuery Storage API
Apache License 2.0
19 stars 3 forks source link

Submit to CRAN #9

Closed meztez closed 8 months ago

meztez commented 4 years ago

At some point in the future, not now, now is not the time.

jeroen commented 11 months ago

Before you can submit to CRAN you will have to request them to add the prerequisites on their Debian servers, otherwise you can't pass the incoming checks.

I think the best way is to send an email to CRAN@R-project.org requesting to add the libgrpc++-dev and protobuf-compiler-grpc packages to their server. The other deps are already there, see this list.

meztez commented 11 months ago

Sent

gaborcsardi commented 11 months ago

I suspect the we'll need to fix this as well: https://github.com/meztez/bigrquerystorage/actions/runs/7168946042/job/19518216883#step:8:125

Files which contain pragma(s) suppressing diagnostics:
  'src/google/api/annotations.pb.h' 'src/google/api/client.pb.h'
  'src/google/api/field_behavior.pb.h' 'src/google/api/http.pb.h'
  'src/google/api/launch_stage.pb.h' 'src/google/api/resource.pb.h'
  'src/google/cloud/bigquery/storage/v1/arrow.pb.h'
  'src/google/cloud/bigquery/storage/v1/avro.pb.h'
  'src/google/cloud/bigquery/storage/v1/protobuf.pb.h'
  'src/google/cloud/bigquery/storage/v1/storage.pb.h'
  'src/google/cloud/bigquery/storage/v1/stream.pb.h'
  'src/google/cloud/bigquery/storage/v1/table.pb.h'
  'src/google/rpc/status.pb.h'

E.g. by porting the fix from the old configure.R before Jeroen's PR.

jeroen commented 11 months ago

I didn't see these warnings on winbuilder so they may not be checked on Windows: https://win-builder.r-project.org/0GJ2L4z4Zm9u/

jeroen commented 11 months ago

CRAN has added libgrpc++-dev and protobuf-compiler-grpc: https://github.com/r-devel/rcheckserver/commit/380349aa02c76a0c3565aaf4f696e74413eeabea

I think your package is ready to be submitted.

Note that it may still take a while, the first submission will be scheduled for manual review by a cran assistent which takes a while and sometimes requires a few rounds of resubmitting to some requests.

meztez commented 11 months ago

@jeroen @gaborcsardi

Got this answer from CRAN. I'll look into it after the holidays.


Prof Brian Ripley ripley@stats.ox.ac.uk 06 h 10 (il y a 2 heures) à Bruno, CRAN, Tomas

This downloads pre-compiled software on macOS, contrary to the CRAN policy (and Dr Ligges confirms permission was not given).

The README says

'Windows

The package will automatically download a static build of the system requirements during installation.'

although that is not clear from the check log. That too would be a policy violation.

The package has been removed from CRAN.


jeroen commented 11 months ago

Ugh, I guess you have to ask them for permission to download libs (grpc is not available on their mac/win servers)

meztez commented 11 months ago

@jeroen Oh, well I guess I'll ask for permission then. Thanks for the input, it gives me confidence to follow up with CRAN. I don't know if I would have taken that step by myself.

jeroen commented 11 months ago

Well it is not certain they will agree. Uwe Ligges is usually helpful and but Brian Ripley can be very unreasonable. So the best path is probably to ask Uwe and explain him the grpc library is not available on their Mac or Windows servers.

gaborcsardi commented 11 months ago

the grpc library is available on their Mac or Windows servers.

not available, you mean?

Btw. I just opened https://bugs.r-project.org/show_bug.cgi?id=18641

jeroen commented 11 months ago

Yes sorry typo 🤦

By the way, in the mean while users can install binary packages from: https://meztez.r-universe.dev/bigrquerystorage

meztez commented 11 months ago

I sent this:

Hi Dr Ligges,

In your opinion, what would be the best way forward to respect CRAN policy and get the package published?

Happy holidays to you and the people in your life.

Bruno

Context :

Some members of the R community have communicated to me their wish to see package bigrquerystorage (https://github.com/meztez/bigrquerystorage) published to CRAN. Github users -jeroen, -gaborcsardi and -hadley have helped me with compatibility on all platforms. (https://github.com/r-dbi/bigrquery/pull/425, https://github.com/r-dbi/bigrquery/issues/381)

It is a companion package to bigrquery that uses the BigQuery Storage API based on their gRPC protocol instead of their regular REST API. It provides better performance for larger datasets.

Unfortunately, it has a dependency (gRPC) that is not yet available on Windows and Mac CRAN servers, as Jeroen tells me. To remedy that, I understand the configure script was modified to install precompiled binaries for Mac and Windows. I was advised to ask for explicit permission after receiving Dr Ripley's communication.

On Mac https://github.com/meztez/bigrquerystorage/blob/b377ea879efb33b8a33a3e276667a40ae207e81d/configure#L45):

On Windows https://github.com/meztez/bigrquerystorage/blob/main/tools/winlibs.R

  url <- if(grepl("aarch", R.version$platform)){
    "https://github.com/r-windows/bundles/releases/download/grpc-1.51.1/grpc-1.51.1-clang-aarch64.tar.xz"
  } else if(grepl("clang", Sys.getenv('R_COMPILED_BY'))){
    "https://github.com/r-windows/bundles/releases/download/grpc-1.51.1/grpc-1.51.1-clang-x86_64.tar.xz"
  } else if(getRversion() >= "4.3") {
    "https://github.com/r-windows/bundles/releases/download/grpc-1.51.1/grpc-1.51.1-ucrt-x86_64.tar.xz"
  } else {
    "https://github.com/rwinlib/grpc/archive/refs/tags/1.33.2-2.tar.gz"
  }
  download.file(url, basename(url), quiet = TRUE)
gaborcsardi commented 9 months ago

@meztez https://bugs.r-project.org/show_bug.cgi?id=18641 has been fixed and Rtools43 has grpc now. So we could try to give this another go, if you can gather the strength for another submission.

We would need to update the build process on Windows slightly, and then re-submit. I can submit a PR for the changes.

Also, if you don't have strength to do it, I completely understand. In this case I can do the first submission, and then once it is safely on CRAN, pass the package to you in the next submission.

eitsupi commented 9 months ago

Note that arrow may be removed from CRAN soon (apache/arrow#39806). Perhaps you would prefer to make your submission after the arrow issue is resolved.......

It might also be an interesting improvement to remove arrow from the required dependencies and replace it with nanoarrow.

meztez commented 9 months ago

@eitsupi

Note that arrow may be removed from CRAN soon (apache/arrow#39806). Perhaps you would prefer to make your submission after the arrow issue is resolved.......

It might also be an interesting improvement to remove arrow from the required dependencies and replace it with nanoarrow.

The arrow dependencies that would have to be replaced are: https://github.com/meztez/bigrquerystorage/blob/d2f5c0dd8067560a9e80dd110cc9af2d0b5aec16/R/bqs_download.R#L112, https://github.com/meztez/bigrquerystorage/blob/d2f5c0dd8067560a9e80dd110cc9af2d0b5aec16/R/bqs_download.R#L123, https://github.com/meztez/bigrquerystorage/blob/d2f5c0dd8067560a9e80dd110cc9af2d0b5aec16/R/bqs_download.R#L116

I've taken a look at nanoarrow and could not find drop in replacement at first glance. Would you have further suggestion on how this could be done?

What would be the benefits of using nanoarrow in this instance?

Thank you

eitsupi commented 9 months ago

What would be the benefits of using nanoarrow in this instance?

The biggest advantage is that there is no need to have the huge and time-consuming to build arrow package as a dependency. The R ADBC-related packages and the Arrow-related methods of the DBI package are all based on nanoarrow. https://arrow.apache.org/adbc/current/r/index.html https://dbi.r-dbi.org/articles/DBI-arrow.html

I've taken a look at nanoarrow and could not find drop in replacement at first glance. Would you have further suggestion on how this could be done?

The correspondence between the objects of the arrow package and those of the nanoarrow package may be understood from this file. https://github.com/apache/arrow-nanoarrow/blob/a41fb2be59f44b6bd6bf5a014aef253041fdc10f/r/R/pkg-arrow.R

Since the Table equivalent object exists only in Apache Arrow's C++ implementation, I believe nanoarrow uses the struct type nanoarrow_array_stream to exchange data in Table format.

meztez commented 9 months ago

@eitsupi I've tried to work out how to use a two elements raws list (schema, data) and turn into a data.frame using nanoarrow. I cannot figure it out at the moment, I'm sorry.

> dt <- as_nanoarrow_array(raws[[2]], schema = infer_nanoarrow_schema(raws[[1]])) |> as.data.frame()
Error in as.data.frame.nanoarrow_array(as_nanoarrow_array(raws[[2]], schema = infer_nanoarrow_schema(raws[[1]]))) : 
  Can't convert array with type uint8 to data.frame()
> 
thisisnic commented 9 months ago

@paleolimbot may be able to help here

paleolimbot commented 9 months ago

I'm guessing here that raws[[1]] is the serialized record batch and raws[[2]] is the serialized schema? nanoarrow has an IPC reader but it hasn't been wired up in the R package yet. Until it has, I'm not sure it can help you here!

gaborcsardi commented 9 months ago

@paleolimbot What exactly do you mean by "IPC reader"? Does that basically mean reading the data from a socket?

Maybe we could do something simple, like writing all the data out to disk, and then reading it from there?

meztez commented 9 months ago

Could we push directly to a nanoarrow object instead of doing this C++ vector push to R raws to arrow table?

I'm thinking pushing directly into a nanoarrow object could save some time?

Do we need arrow to convert back into a R data.frame?

meztez commented 9 months ago

Until arrow is back on CRAN, I think we will have to hold. https://win-builder.r-project.org/incoming_pretest/bigrquerystorage_1.0.0_20240217_202902/Debian/00check.log https://github.com/apache/arrow/issues/39806

paleolimbot commented 9 months ago

Sorry I missed the ping!

What exactly do you mean by "IPC reader"?

I believe that "IPC" (which can be used for inter-process communication, like via a socket, but works anywhere you might want to stream a table from a server to a client) is the format that you are currently using the arrow package to read (because when I browsed the code I saw a RecordBatchStreamReader). In order to use nanoarrow to replace that bit of code, I (or somebody) would need to add a binding for nanoarrow (C)'s "ArrowIpcArrayStreamReader". The Python version of that is here: https://github.com/apache/arrow-nanoarrow/pull/388 .

Could we push directly to a nanoarrow object

It seems like your input here is a fully downloaded raw vector of the entire stream. When the IPC reader has R bindings, that would give you a nanoarrow_array_stream, on which you could call as.data.frame(). Technically you wouldn't have to materialize the bytes in advance (you could stream the input).

Until arrow is back on CRAN, I think we will have to hold.

I also had trouble submitting a small nanoarrow patch because it had arrow in Suggests 🙁

paleolimbot commented 9 months ago

I gave the implementation in R a go, which should make it into the next release in the next two months or so ( https://github.com/apache/arrow-nanoarrow/pull/390 ). You should be able to nanoarrow::read_nanoarrow(bqs_ipc_stream(...)) |> as.data.frame() (no pressure).

gaborcsardi commented 8 months ago

In the meanwhile, it seems like arrow is back on CRAN [1], so we can also submit the current version, and then later switch to nanoarrow.

[1] https://fosstodon.org/deck/@nic_crane@mastodon.social/111981942640349340

meztez commented 8 months ago

In the meanwhile, it seems like arrow is back on CRAN [1], so we can also submit the current version, and then later switch to nanoarrow.

[1] https://fosstodon.org/deck/@nic_crane@mastodon.social/111981942640349340

Let's hope third time is the charm.

meztez commented 8 months ago

Thanks, on its way to CRAN.

Best, Benjamin Altmann