r-dbi / bigrquery

An interface to Google's BigQuery from R.
https://bigrquery.r-dbi.org
Other
510 stars 180 forks source link

Allow for cross-dataset interactions in the dplyr backend #101

Closed mpancia closed 6 years ago

mpancia commented 8 years ago

The src_bigquery function in bigrquery takes in a dataset argument, which specifies which dataset tables are accessed from.

While this is, of course, required in order to specify a table, there is no reason that when tables are joined or otherwise compared across datasets in the same project that there needs to be copying of one table into the other dataset.

I suggest that src_bigquery ought to allow for the input of either a particular dataset or an entire project, in which case tbl can source from any table in any dataset in the project. Provided that the subsequent queries use the fully-qualified table names (e.g. publicdata.natality as opposed to just natality), there should not be any problems.

This is a current issue for me: I have a project which looks like:

If I want to access these using dplyr and join them then I'd do something like

ds.1 <- src_bigquery("project", "dataset1")
ds.2 <- src_bigquery("project", "dataset2")
tb.A <- tbl(ds.1, "tablea")
tb.B <- tbl(ds.2, "tableb")
tb.joined <- tb.A %>% inner_join(tb.B, by = "col")

This would throw an error: Error: x and y don't share the same src. Set copy = TRUE to copy y into x's source (this may be time consuming).

However, this is not necessary, as datasets across projects are allowed to access each other. I'd like to avoid the copying in this scenario.

That is (as in, say, MySQL) the following query would work just fine:

SELECT * FROM 
project::dataset1.tablea as tbl_a 
INNER JOIN 
project::dataset2.tableb as tbl_b
ON  tbl_a.col = tbl_b.col
mpancia commented 7 years ago

A note here, given that people seem to be stumbling across this during Google searches. You can get around this by choosing a single schema to generate the src object and then obtaining foreign tables by writing the SQL query to source them. For example:

ds.1 <- src_bigquery("project", "dataset1")
tb.A <- tbl(ds.1, "tablea")
tb.B <- tbl(ds.2, sql("SELECT * FROM dataset2.tableb"))
tb.joined <- tb.A %>% inner_join(tb.B, by = "col")

The same problem exists in Postgres when attempting to join across schemata, and the same trick works, although there you are not required to specify a search_path, so fully-qualifying table names does the trick.

hadley commented 7 years ago

I think this will get naturally resolved once I update to dplyr 0.6.0

peishenwu commented 7 years ago

Hi, I am also running into this issue when performing two table verbs (left_join, union...) with tables from two different datasets on bigquery. However, when I tried dplyr 0.7, the problem still exist. How is the current status about this issue?

My system is dplyr 0.7 + bigrquery 0.3.0.9000 + dbplyr

hadley commented 7 years ago

@peishenwu please provide a minimal reproducible example using the reprex package.

mpancia commented 7 years ago

The trick that I mentioned above fails to work, and this seems partially due totbl being incapable of sourcing BQ tables from arbitrary SQL.

That is, with the newest versions of dbplyr, bigrquery, and dplyr, you cannot do something like:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(bigrquery)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
set_service_token("~/Downloads/My Project-4f2d250d1023.json")
#> Warning in strptime(x, fmt, tz = "GMT"): unknown timezone 'default/America/
#> Los_Angeles'
con_us <- DBI::dbConnect(dbi_driver(),
                      project = "bigquery-public-data",
                      dataset = "utility_us",
                      billing = "profound-ranger-99202"
)

country_code_iso_us <- con_us %>% tbl("country_code_iso")
country_code_iso_eu <- con_us %>% tbl("SELECT * FROM utility_eu.country_code_iso")
#> Error: HTTP error [400] <!DOCTYPE html>
#> <html lang=en>
#>   <meta charset=utf-8>
#>   <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
#>   <title>Error 400 (Bad Request)!!1</title>
#>   <style>
#>     *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
#>   </style>
#>   <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
#>   <p><b>400.</b> <ins>That’s an error.</ins>
#>   <p>Your client has issued a malformed or illegal request.  <ins>That’s all we know.</ins>

Digging into the code a bit, it seems to me that this comes from this strategy for getting table information. If the table being created is ad-hoc, then that API call is going to fail.

Maybe I'm being silly (I am not super familiar with this codebase), but for something that would work with ad-hoc tables, an alternative way to get the table info would be to instead run a dummy query job and inspect the schema, as per the api reference and similar to what is done in the dbplyr package (cf. this) .

For example, making the call to the BQ query API:

{
  "query": "SELECT series_id, year FROM `bigquery-public-data.bls.c_cpi_u` WHERE 1=0",
}

returns

{
 "kind": "bigquery#queryResponse",
 "schema": {
  "fields": [
   {
    "name": "series_id",
    "type": "STRING",
    "mode": "NULLABLE"
   },
   {
    "name": "year",
    "type": "INTEGER",
    "mode": "NULLABLE"
   }
  ]
 },
 "jobReference": {
  "projectId": "profound-ranger-99202",
  "jobId": "job_O_Y2MADZRJcEFRT-WTCFR_xUS4U"
 },
 "totalRows": "0",
 "totalBytesProcessed": "135584",
 "jobComplete": true,
 "cacheHit": false
}

From which, one can extract the table metadata.

peishenwu commented 7 years ago

Here I use the public dataset provided by bigquery as an example.

library(bigrquery)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

mybilling <- readRDS("/Users/peishenwu/bqbilling.rds")
bigrquery::set_service_token("/Users/peishenwu/bqbilling.json")

con1 <- DBI::dbConnect(bigrquery::dbi_driver(),
                       dataset = "noaa_gsod",
                       project = "bigquery-public-data",
                       billing = mybilling,
                       use_legacy_sql = F)

con2 <- DBI::dbConnect(bigrquery::dbi_driver(),
                       dataset = "samples",
                       project = "bigquery-public-data",
                       billing = mybilling,
                       use_legacy_sql = F)

tbl1 <- tbl(con1, "gsod2008")
tbl2 <- tbl(con2, "gsod")

## show output
tbl1 %>% head
#> 623.0 megabytes processed
#> # Source:   lazy query [?? x 32]
#> # Database: BigQueryConnection
#>      stn  wban  year    mo    da  temp count_temp   dewp count_dewp    slp
#>    <chr> <chr> <chr> <chr> <chr> <dbl>      <int>  <dbl>      <int>  <dbl>
#> 1 983275 99999  2008    08    28  77.5          4   72.1          4 9999.9
#> 2 977800 99999  2008    08    13  67.9          4   59.5          4 9999.9
#> 3 977800 99999  2008    04    16  68.1          4   61.5          4 9999.9
#> 4 977900 99999  2008    01    12  83.2          4 9999.9          0 9999.9
#> 5 977600 99999  2008    03    18  83.3          4   77.3          4 9999.9
#> 6 977800 99999  2008    07    15  69.9          4   59.6          4 9999.9
#> # ... with 22 more variables: count_slp <int>, stp <dbl>, count_stp <int>,
#> #   visib <dbl>, count_visib <int>, wdsp <chr>, count_wdsp <chr>,
#> #   mxpsd <chr>, gust <dbl>, max <dbl>, flag_max <chr>, min <dbl>,
#> #   flag_min <chr>, prcp <dbl>, flag_prcp <chr>, sndp <dbl>, fog <chr>,
#> #   rain_drizzle <chr>, snow_ice_pellets <chr>, hail <chr>, thunder <chr>,
#> #   tornado_funnel_cloud <chr>
tbl2 %>% head
#> 16.1 gigabytes processed
#> # Source:   lazy query [?? x 31]
#> # Database: BigQueryConnection
#>   station_number wban_number  year month   day mean_temp
#>            <int>       <int> <int> <int> <int>     <dbl>
#> 1         915925       99999  2009     3     9      79.5
#> 2         944300       99999  1993     8     8      58.7
#> 3          61500       99999  1975     6    12      64.2
#> 4         247680       99999  1987    10    27     -12.3
#> 5          72450       99999  2009     1    13      37.1
#> 6         173520       99999  1998     6    10      77.8
#> # ... with 25 more variables: num_mean_temp_samples <int>,
#> #   mean_dew_point <dbl>, num_mean_dew_point_samples <int>,
#> #   mean_sealevel_pressure <dbl>,
#> #   num_mean_sealevel_pressure_samples <int>, mean_station_pressure <dbl>,
#> #   num_mean_station_pressure_samples <int>, mean_visibility <dbl>,
#> #   num_mean_visibility_samples <int>, mean_wind_speed <dbl>,
#> #   num_mean_wind_speed_samples <int>, max_sustained_wind_speed <dbl>,
#> #   max_gust_wind_speed <dbl>, max_temperature <dbl>,
#> #   max_temperature_explicit <lgl>, min_temperature <dbl>,
#> #   min_temperature_explicit <lgl>, total_precipitation <dbl>,
#> #   snow_depth <dbl>, fog <lgl>, rain <lgl>, snow <lgl>, hail <lgl>,
#> #   thunder <lgl>, tornado <lgl>

##
tbl1 %>% left_join(tbl2, by = c("year" = "year",
                                "mo" = "month",
                                "da" = "day"))
#> Error: Not found: Table bigquery-public-data:noaa_gsod.gsod

tbl2 %>% left_join(tbl1, by = c("year" = "year",
                                "month" = "mo",
                                "day" = "da"))
#> Error: Not found: Table bigquery-public-data:samples.gsod2008

The joins should work, however it seems to assume that both LHS and RHS came from the same dataset as the LHS, which is not true in this case.

ENsu commented 7 years ago

+1

tinoater commented 6 years ago

This is still an issue, and is causing a problem for me as we have views and tables in different datasets (so permissions can be assigned at dataset level). Are there any workarounds I could patch in in the meantime?

hadley commented 6 years ago

I'm closing this issue since it's actually two separate issues (linked above, with minimal reprexes)