bigrquery integration - Githubissues

hadley commented 8 months ago

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

meztez commented 8 months ago

If bigquerystorage is available, can you think of any reason that bigrquery shouldn't use it by default? (with some argument/option to opt-out, if needed).

I've taken notice of your question.

I will do some further testing to validate my hypothesis and come back with a comprehensive response in a timely manner.

Thank you (Posit) for the coffee mugs. I will use them to top up during the experiments.

meztez commented 8 months ago

Here is what I confirmed. I see two possible issues.

startIndex does not have an equivalent on the BigQuery Storage API side. It will currently download all the data, then cut them off before returning results. This can be an issue for extremely large tables with big startIndex and small n_max. I thought offset could be used, but it does not seem like it.
Overhead on channel creation is a little larger than the regular API. This might be an issue if your workload is mostly very small tables.

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#readrowsrequest https://github.com/tensorflow/io/issues/876 https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/R/bqs_download.R#L267

hadley commented 8 months ago

I'm not too concerned about your first point; we can just make sure that start_index gives a clear error if you use it with the arrow endpoint.

The second point is interesting, but I think will make itself clear when in bigrquery itself because all the tests use tiny datasets (I always think it's hilarious that I'm using bigquery to query mtcars 😆).

hadley commented 7 months ago

When I try and download a public dataset, I see:

Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.

Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

hadley commented 7 months ago

Do you expect that this would take 30+ seconds?

system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))

The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

meztez commented 7 months ago

When I try and download a public dataset, I see:
Error:
! gRPC method CreateReadSession error -> Permission denied on resource project publicdata.
Is that expected? Is that a case where I need to pass in parent, which I guess corresponds to the bigrquery idea of a billing project?

It needs a billable project with access to publicdata as mentionned. https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/README.Rmd#L159

meztez commented 7 months ago

Do you expect that this would take 30+ seconds?
system.time(x <- bigrquerystorage::bqs_table_download(
  x = "publicdata.samples.natality",
  parent = bigrquery::bq_test_project(),
  n_max = 100000,
))
The initial download progress bar seems to flash by very quickly and then nothing happens for a while.

I'm also surprised that if I set n_max = 10000 I see Streamed 142080 rows in 148 messages.. But if I set n_max = 100000 I see Streamed 232320 rows in 242 messages.

Also if I omit n_max I see an progress bar with an ETA of 5 minutes, and then press Ctrl + C to cancel, it takes at least 5 seconds before it actually cancels. Maybe there's a loop that needs to check for interrupts a bit more frequently?

I only see an interrupt in the loop when in quiet mode, I vaguely remember experimenting with interrupt/performance tradeoff : https://github.com/meztez/bigrquerystorage/blob/3f8f95644f238dc8881d74553ebc18d4e6cca230/src/bqs.cpp#L177

I will test performance on a similar dataset. I wonder if it's memory/swap related.

hadley commented 7 months ago

One more data pont: if I download publicdata.samples.shakespeare, it only takes 1s.

meztez commented 7 months ago

Ok, there is something going on with the n_max / truncation method. This is not something we use alot internally so it was not detected. Fixing it.

meztez commented 7 months ago

Really sorry you had to experience these issues. I believe I identified the issue and corrected the relevant code to allow for interrupt checks more frequently. Every 100 messages should have a marginal performance impact.

I've reworked the n_max / truncation to use the recommanded TryCancel ClientContext method to cancel streaming cleanly with n_max is hit.

Should take between 1-2 seconds now.

hadley commented 7 months ago

No problems and thanks so much for fixing this so quickly!