meztez / bigrquerystorage

R Client for BigQuery Storage API
Apache License 2.0
19 stars 3 forks source link

BigQuery Storage Read API ignores `selected_fields` order and returns the original table field ordering #72

Closed botan closed 3 hours ago

botan commented 4 hours ago

It seems that the BigQuery Storage Read API doesn't respect the order of fields specified by the user and instead returns the fields in the original table order. I was wondering if it would be a better user experience if bigrquerystorage reordered the columns according to the user-supplied selected_fields order.

botan commented 4 hours ago

The example below requires bigquery-public-data.usa_names.usa_1910_current table to be copied to your project.

library(bigrquerystorage)
library(glue)

billing <- Sys.getenv("GCP_BILLING_PROJECT_ID")

fields <- c("name", "number", "state")

bigquery_storage_api_rows <-
  bqs_table_download(
    x = glue("{billing}.usa_names.usa_1910_current"),
    selected_fields = fields,
    row_restriction = 'state = "WA"'
  )

fields == colnames(bigquery_storage_api_rows)
#> [1] FALSE FALSE FALSE
colnames(bigquery_storage_api_rows)
#> [1] "state"  "name"   "number"
meztez commented 3 hours ago

Since this is supposed to mimic the BigQuery Read API. https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#tablereadoptions

The order of the fields in the read session schema is derived from the table schema and does not correspond to the order in which the fields are specified in this list.

Ordering would be done from a select statement or a method using this API.

One way from the example above:

library(bigrquerystorage)
library(glue)

billing <- Sys.getenv("GCP_BILLING_PROJECT_ID")

fields <- c("name", "number", "state")

bigquery_storage_api_rows <-
  bqs_table_download(
    x = glue("{billing}.usa_names.usa_1910_current"),
    selected_fields = fields,
    row_restriction = 'state = "WA"'
  )[fields]

fields == colnames(bigquery_storage_api_rows)
> fields == colnames(bigquery_storage_api_rows)
[1] TRUE TRUE TRUE
> colnames(bigquery_storage_api_rows)
[1] "name"   "number" "state" 
botan commented 3 hours ago

That sounds reasonable, thanks!