meztez / bigrquerystorage

R Client for BigQuery Storage API
Apache License 2.0
19 stars 3 forks source link

bigrquery integration #46

Closed hadley closed 1 month ago

hadley commented 8 months ago

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

meztez commented 8 months ago

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

I've taken notice of your question.

I will do some further testing to validate my hypothesis and come back with a comprehensive response in a timely manner.

Thank you (Posit) for the coffee mugs. I will use them to top up during the experiments.

meztez commented 8 months ago

Here is what I confirmed. I see two possible issues.

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#readrowsrequest https://github.com/tensorflow/io/issues/876 https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/R/bqs_download.R#L267

hadley commented 8 months ago

I'm not too concerned about your first point; we can just make sure that start_index gives a clear error if you use it with the arrow endpoint.

The second point is interesting, but I think will make itself clear when in bigrquery itself because all the tests use tiny datasets (I always think it's hilarious that I'm using bigquery to query mtcars 😆).

hadley commented 7 months ago

When I try and download a public dataset, I see:

Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.

Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

hadley commented 7 months ago

Do you expect that this would take 30+ seconds?

system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))

The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

meztez commented 7 months ago

When I try and download a public dataset, I see:

Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.

Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

It needs a billable project with access to publicdata as mentionned. https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/README.Rmd#L159

meztez commented 7 months ago

Do you expect that this would take 30+ seconds?

system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))

The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

I only see an interrupt in the loop when in quiet mode, I vaguely remember experimenting with interrupt/performance tradeoff : https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/src/bqs.cpp#L177

I will test performance on a similar dataset. I wonder if it's memory/swap related.

hadley commented 7 months ago

One more data pont: if I download publicdata.samples.shakespeare, it only takes 1s.

meztez commented 7 months ago

Ok, there is something going on with the n_max / truncation method. This is not something we use alot internally so it was not detected. Fixing it.

meztez commented 7 months ago

Really sorry you had to experience these issues. I believe I identified the issue and corrected the relevant code to allow for interrupt checks more frequently. Every 100 messages should have a marginal performance impact.

I've reworked the n_max / truncation to use the recommanded TryCancel ClientContext method to cancel streaming cleanly with n_max is hit.

Should take between 1-2 seconds now.

hadley commented 7 months ago

No problems and thanks so much for fixing this so quickly!

meztez commented 1 month ago

One more thing that needs to be addressed is how circular imports can be managed. bigrquerystorage use the following functions from bigrquery.

meztez commented 1 month ago

One more thing that needs to be addressed is how circular imports can be managed. bigrquerystorage use the following functions from bigrquery.

  • bq_table_fields

  • bq_has_token

  • .auth$cred environment valuie

Should not be an issue now. Let me know if you need anything else. Have a good weekend.

meztez commented 1 month ago

It seems like @gaborcsardi fixed support back to R 4.0. Looking forward to this being integrated into bigrquery.

hadley commented 1 month ago

@meztez do you think you could do a CRAN release in the near future? Then I'll advertise dev bigrquery to get some people actually using it and then do a CRAN release of my own.

meztez commented 1 month ago

Submitted as 1.2.0, will release on github when CRAN publish.

hadley commented 1 month ago

Awesome, thank you!

meztez commented 1 month ago

@hadley released bigrquerystorage 1.2.0 : https://cran.r-project.org/web/packages/bigrquerystorage/index.html