pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
398 stars 35 forks source link

feat: Adds support for scanning parquet from GCP #1056

Closed andyquinterom closed 3 weeks ago

andyquinterom commented 3 weeks ago

Related to #1054

This should support GCP parquet scanning.

I tested this snippet and it works perfectly.

library(polars)
library(dplyr)
library(tidypolars)

iris_parquet <- pl$scan_parquet(
  "gs://path/to/*.parquet",
  cloud_options = c("service_account" = "./service_account.json")
)

iris_parquet |>
  select(starts_with(c("Sep", "Pet"))) |> 
  mutate(
    petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
  ) |>
  filter(between(Sepal.Length, 4.5, 5.5)) |> 
  head() |>
  compute()
andyquinterom commented 3 weeks ago

Thanks @andyquinterom, I don't have any experience with interacting directly with cloud providers from R but this looks nice. I also can't compile on Windows for now, I have the same errors as in some of the CI. I'll explore a bit why.

In the meantime, I added a few minor comments. I would also like to know how we can test this. Is there a way to mock the connection to GCP and other services?

Thanks for taking the time to read my submission.

I'll fix some details you commented, it was late last night and things could be improved.

I'll check the compilation errors in windows as well to see if I find anything.

On the mocking side, this functionality is fully on the Rust polars-io crate. I don't know if we could truly mock it or rely on the rust testing implementation. I will see how the Python layer is tested for an idea.

andyquinterom commented 3 weeks ago

@etiennebacher I just pushed some commits that should fix the issues you mentioned

andyquinterom commented 3 weeks ago

@etiennebacher It seems the CI fail algo happens on main. https://github.com/pola-rs/r-polars/actions/runs/8803923620/job/24163176648

etiennebacher commented 3 weeks ago

One failure is due to extendr and the R-devel changes upstream, but the others are specific to this PR, e.g https://github.com/pola-rs/r-polars/actions/runs/8803957782/job/24163425862?pr=1056

andyquinterom commented 3 weeks ago

One failure is due to extendr and the R-devel changes upstream, but the others are specific to this PR, e.g https://github.com/pola-rs/r-polars/actions/runs/8803957782/job/24163425862?pr=1056

@etiennebacher Just pushed a change adding -lcrypt32 to the windows linker arguments. Let's see if this fixes the issue

andyquinterom commented 3 weeks ago

@etiennebacher Seems like the CI issues are solved.

etiennebacher commented 3 weeks ago

I still see a similar error here: https://github.com/pola-rs/r-polars/actions/runs/8805357520/job/24168950149?pr=1056#step:9:335

On the mocking side, this functionality is fully on the Rust polars-io crate. I don't know if we could truly mock it or rely on the rust testing implementation. I will see how the Python layer is tested for an idea.

Ideally we want to test our implementation as well. They take care of all the details but we want to ensure that our R binding is correct. I don't know whether they can mock this in python but if they do, it would be great to mimic it.

andyquinterom commented 3 weeks ago

@etiennebacher

I'm looking into the tests it seems like during testing an S3 bucket is created.

https://github.com/pola-rs/polars/blob/ba85175a830601ba622f0cf3615e2297c70b9e96/py-polars/tests/unit/io/cloud/test_aws.py

This would make it pretty difficult to test unless we get some credential for creating and uploading files to S3.

andyquinterom commented 3 weeks ago

@etiennebacher it seems like I finally got it to link https://github.com/pola-rs/r-polars/actions/runs/8808439596/job/24177961258

etiennebacher commented 3 weeks ago

Thanks, I can compile too. For testing, I think we can use some S3/GCS buckets that are publicly available, like those in this arrow vignette. This is not ideal because we rely on something external but I don't have an alternative for now. There might be more ideas in the arrow test suite.

However, the following doesn't work for me and I don't know why:

library(arrow)
arrow_with_gcs()
#> [1] TRUE

arrow::read_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet") |> 
  head()
#>   carat color clarity depth table price    x    y    z
#> 1  0.23     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 2  0.31     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 3  0.30     J     SI1  64.0    55   339 4.25 4.28 2.73
#> 4  0.30     J     SI1  63.4    54   351 4.23 4.29 2.70
#> 5  0.30     J     SI1  63.8    56   351 4.23 4.26 2.71
#> 6  0.30     I     SI2  63.3    56   351 4.26 4.30 2.71

pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")

#> Error: Execution halted with the following contexts
#>    0: In R: in pl$scan_parquet():
#>    0: During function call [pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")]
#>    1: Encountered the following error in Rust-Polars:
#>          Generic GCS error: Error performing token request: Error after 2 retries in 3.3478343s, max_retries:2, retry_timeout:10s, 
#>            source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): 
#>            error trying to connect: tcp connect error: deadline has elapsed

@eitsupi do you have some ideas about this?

eitsupi commented 3 weeks ago

I think Polars does not support such addresses, because I get an error when I try that string in Python Polars.

andyquinterom commented 3 weeks ago

Thanks, I can compile too. For testing, I think we can use some S3/GCS buckets that are publicly available, like those in this arrow vignette. This is not ideal because we rely on something external but I don't have an alternative for now. There might be more ideas in the arrow test suite.

However, the following doesn't work for me and I don't know why:


library(arrow)

arrow_with_gcs()

#> [1] TRUE

arrow::read_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet") |> 

  head()

#>   carat color clarity depth table price    x    y    z

#> 1  0.23     E     VS1  56.9    65   327 4.05 4.07 2.31

#> 2  0.31     J     SI2  63.3    58   335 4.34 4.35 2.75

#> 3  0.30     J     SI1  64.0    55   339 4.25 4.28 2.73

#> 4  0.30     J     SI1  63.4    54   351 4.23 4.29 2.70

#> 5  0.30     J     SI1  63.8    56   351 4.23 4.26 2.71

#> 6  0.30     I     SI2  63.3    56   351 4.26 4.30 2.71

pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")

#> Error: Execution halted with the following contexts

#>    0: In R: in pl$scan_parquet():

#>    0: During function call [pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")]

#>    1: Encountered the following error in Rust-Polars:

#>        Generic GCS error: Error performing token request: Error after 2 retries in 3.3478343s, max_retries:2, retry_timeout:10s, 

#>            source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): 

#>            error trying to connect: tcp connect error: deadline has elapsed

@eitsupi do you have some ideas about this?

This also happens on the Rust side, if we don't have a Service Account given it will error out since it doesn't know where to look for the metadata of the bucket. GCP buckets seem to not be tested in the Python side.

The only way to fix this "issue" would be to run the GCP test suite on GCP servers or use a service account.

eitsupi commented 3 weeks ago

I know of a server product that simulates such a GCS, but I don't know if polars can connect to this (and if it is even necessary to test with it). https://github.com/fsouza/fake-gcs-server

At the moment I do not have the time to test this.

andyquinterom commented 3 weeks ago

If I provide a service account it works with the dataset in the arrow vignette.

library(polars)
library(dplyr)
library(tidypolars)

pl$scan_parquet(
  "gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet",
  cloud_options = c(
    service_account = "./service_account.json"
  )
)$collect()
# shape: (4_906, 10)
# ┌───────┬───────┬─────────┬───────┬───┬──────┬──────┬──────┬──────┐
# │ carat ┆ color ┆ clarity ┆ depth ┆ … ┆ x    ┆ y    ┆ z    ┆ cut  │
# │ ---   ┆ ---   ┆ ---     ┆ ---   ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
# │ f64   ┆ cat   ┆ cat     ┆ f64   ┆   ┆ f64  ┆ f64  ┆ f64  ┆ str  │
# ╞═══════╪═══════╪═════════╪═══════╪═══╪══════╪══════╪══════╪══════╡
# │ 0.23  ┆ E     ┆ VS1     ┆ 56.9  ┆ … ┆ 4.05 ┆ 4.07 ┆ 2.31 ┆ Good │
# │ 0.31  ┆ J     ┆ SI2     ┆ 63.3  ┆ … ┆ 4.34 ┆ 4.35 ┆ 2.75 ┆ Good │
# │ 0.3   ┆ J     ┆ SI1     ┆ 64.0  ┆ … ┆ 4.25 ┆ 4.28 ┆ 2.73 ┆ Good │
# │ 0.3   ┆ J     ┆ SI1     ┆ 63.4  ┆ … ┆ 4.23 ┆ 4.29 ┆ 2.7  ┆ Good │
# │ 0.3   ┆ J     ┆ SI1     ┆ 63.8  ┆ … ┆ 4.23 ┆ 4.26 ┆ 2.71 ┆ Good │
# │ …     ┆ …     ┆ …       ┆ …     ┆ … ┆ …    ┆ …    ┆ …    ┆ …    │
# │ 0.8   ┆ G     ┆ VS2     ┆ 64.2  ┆ … ┆ 5.84 ┆ 5.81 ┆ 3.74 ┆ Good │
# │ 0.84  ┆ I     ┆ VS1     ┆ 63.7  ┆ … ┆ 5.94 ┆ 5.9  ┆ 3.77 ┆ Good │
# │ 0.74  ┆ D     ┆ SI1     ┆ 63.1  ┆ … ┆ 5.71 ┆ 5.74 ┆ 3.61 ┆ Good │
# │ 0.79  ┆ F     ┆ SI1     ┆ 58.1  ┆ … ┆ 6.06 ┆ 6.13 ┆ 3.54 ┆ Good │
# │ 0.72  ┆ D     ┆ SI1     ┆ 63.1  ┆ … ┆ 5.69 ┆ 5.75 ┆ 3.61 ┆ Good │
# └───────┴───────┴─────────┴───────┴───┴──────┴──────┴──────┴──────┘

I guess I could try to create a fake service account that is enough to have the Rust side of things to the request as it needs. I will try to look into the Rust code to see if there are improvements to be made however, that would take quite some time to dig into.

andyquinterom commented 3 weeks ago

@etiennebacher I would also be willing to create a dummy service account, but I am not very confident that's outside of what the project would be comfortable doing due to possible security concerns.

etiennebacher commented 3 weeks ago

The only way to fix this "issue" would be to run the GCP test suite on GCP servers or use a service account.

How does the arrow package deal with that? They have tests for GCP if I'm not mistaken.

I would also be willing to create a dummy service account

I don't think that's the way to go. I suppose the ideal solution would be for rust-polars to support this kind of address but I don't know if they plan to do it (and I won't request this there because I never use this type of service).

For now, let's just mention that this argument is experimental but eventually we'll need some way to test this. I'll review more properly tomorrow but you can already rename this arg from cloud_options to storage_options

andyquinterom commented 3 weeks ago

@etiennebacher Just pushed the changes. I tried to stay almost as similar as possible to the python documentation. Tell me if you need anything more done.

andyquinterom commented 3 weeks ago

Thanks, I just added a news item. Actually I see this PR adds this arg only for pl$scan_parquet(), but it should also be in pl$read_parquet(). Can you add it too? Sorry, I forgot that earlier

@etiennebacher I did not change the read_parquet function. I have only validated functionality on scan_parquet. I see the implementation is a simple collect over a scan so I will get to it once I have some time

etiennebacher commented 3 weeks ago

@andyquinterom I added the arg in read_parquet so that I can merge once the ci passes. Many thanks for your contribution!

andyquinterom commented 3 weeks ago

@etiennebacher I just noticed your comment. I had just pushed the changes as well :/

Tell me if you want me to revert the changes or if they are ok as they are?

etiennebacher commented 3 weeks ago

I'll wrap up the PR, you can let it as is (the details section is missing in the docs of read_parquet)