Closed hadley closed 1 month ago
If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).
I've taken notice of your question.
I will do some further testing to validate my hypothesis and come back with a comprehensive response in a timely manner.
Thank you (Posit) for the coffee mugs. I will use them to top up during the experiments.
Here is what I confirmed. I see two possible issues.
https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#readrowsrequest https://github.com/tensorflow/io/issues/876 https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/R/bqs_download.R#L267
I'm not too concerned about your first point; we can just make sure that start_index
gives a clear error if you use it with the arrow endpoint.
The second point is interesting, but I think will make itself clear when in bigrquery itself because all the tests use tiny datasets (I always think it's hilarious that I'm using bigquery to query mtcars
😆).
When I try and download a public dataset, I see:
Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.
Is that expected? Is that a case where I need to pass in parent
, which I guess corresponds to the bigrquery idea of a billing project?
Do you expect that this would take 30+ seconds?
system.time(x <- bigrquerystorage::bqs_table_download(
x = "publicdata.samples.natality",
parent = bigrquery::bq_test_project(),
n_max = 100000,
))
The initial download progress bar seems to flash by very quickly and then nothing happens for a while.
I'm also surprised that if I set n_max = 10000
I see Streamed 142080 rows in 148 messages.
. But if I set n_max = 100000
I see Streamed 232320 rows in 242 messages
.
Also if I omit n_max
I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?
When I try and download a public dataset, I see:
Error: ! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.
Is that expected? Is that a case where I need to pass in
parent
, which I guess corresponds to the bigrquery idea of a billing project?
It needs a billable project with access to publicdata as mentionned. https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/README.Rmd#L159
Do you expect that this would take 30+ seconds?
system.time(x <- bigrquerystorage::bqs_table_download( x = "publicdata.samples.natality", parent = bigrquery::bq_test_project(), n_max = 100000, ))
The initial download progress bar seems to flash by very quickly and then nothing happens for a while.
I'm also surprised that if I set
n_max = 10000
I seeStreamed 142080 rows in 148 messages.
. But if I setn_max = 100000
I seeStreamed 232320 rows in 242 messages
.Also if I omit
n_max
I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?
I only see an interrupt in the loop when in quiet mode, I vaguely remember experimenting with interrupt/performance tradeoff : https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/src/bqs.cpp#L177
I will test performance on a similar dataset. I wonder if it's memory/swap related.
One more data pont: if I download publicdata.samples.shakespeare
, it only takes 1s.
Ok, there is something going on with the n_max / truncation method. This is not something we use alot internally so it was not detected. Fixing it.
Really sorry you had to experience these issues. I believe I identified the issue and corrected the relevant code to allow for interrupt checks more frequently. Every 100 messages should have a marginal performance impact.
I've reworked the n_max / truncation to use the recommanded TryCancel ClientContext method to cancel streaming cleanly with n_max is hit.
Should take between 1-2 seconds now.
No problems and thanks so much for fixing this so quickly!
One more thing that needs to be addressed is how circular imports can be managed. bigrquerystorage
use the following functions from bigrquery
.
One more thing that needs to be addressed is how circular imports can be managed.
bigrquerystorage
use the following functions frombigrquery
.
bq_table_fields
bq_has_token
.auth$cred environment valuie
Should not be an issue now. Let me know if you need anything else. Have a good weekend.
It seems like @gaborcsardi fixed support back to R 4.0. Looking forward to this being integrated into bigrquery.
@meztez do you think you could do a CRAN release in the near future? Then I'll advertise dev bigrquery to get some people actually using it and then do a CRAN release of my own.
Submitted as 1.2.0, will release on github when CRAN publish.
Awesome, thank you!
@hadley released bigrquerystorage 1.2.0 : https://cran.r-project.org/web/packages/bigrquerystorage/index.html
If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).