Closed mpancia closed 6 years ago
A note here, given that people seem to be stumbling across this during Google searches. You can get around this by choosing a single schema to generate the src
object and then obtaining foreign tables by writing the SQL query to source them. For example:
ds.1 <- src_bigquery("project", "dataset1")
tb.A <- tbl(ds.1, "tablea")
tb.B <- tbl(ds.2, sql("SELECT * FROM dataset2.tableb"))
tb.joined <- tb.A %>% inner_join(tb.B, by = "col")
The same problem exists in Postgres when attempting to join across schemata, and the same trick works, although there you are not required to specify a search_path
, so fully-qualifying table names does the trick.
I think this will get naturally resolved once I update to dplyr 0.6.0
Hi, I am also running into this issue when performing two table verbs (left_join, union...) with tables from two different datasets on bigquery. However, when I tried dplyr 0.7, the problem still exist. How is the current status about this issue?
My system is dplyr 0.7 + bigrquery 0.3.0.9000 + dbplyr
@peishenwu please provide a minimal reproducible example using the reprex package.
The trick that I mentioned above fails to work, and this seems partially due totbl
being incapable of sourcing BQ tables from arbitrary SQL.
That is, with the newest versions of dbplyr
, bigrquery
, and dplyr
, you cannot do something like:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(bigrquery)
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
set_service_token("~/Downloads/My Project-4f2d250d1023.json")
#> Warning in strptime(x, fmt, tz = "GMT"): unknown timezone 'default/America/
#> Los_Angeles'
con_us <- DBI::dbConnect(dbi_driver(),
project = "bigquery-public-data",
dataset = "utility_us",
billing = "profound-ranger-99202"
)
country_code_iso_us <- con_us %>% tbl("country_code_iso")
country_code_iso_eu <- con_us %>% tbl("SELECT * FROM utility_eu.country_code_iso")
#> Error: HTTP error [400] <!DOCTYPE html>
#> <html lang=en>
#> <meta charset=utf-8>
#> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
#> <title>Error 400 (Bad Request)!!1</title>
#> <style>
#> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
#> </style>
#> <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
#> <p><b>400.</b> <ins>That’s an error.</ins>
#> <p>Your client has issued a malformed or illegal request. <ins>That’s all we know.</ins>
Digging into the code a bit, it seems to me that this comes from this strategy for getting table information. If the table being created is ad-hoc, then that API call is going to fail.
Maybe I'm being silly (I am not super familiar with this codebase), but for something that would work with ad-hoc tables, an alternative way to get the table info would be to instead run a dummy query job and inspect the schema, as per the api reference and similar to what is done in the dbplyr
package (cf. this) .
For example, making the call to the BQ query API:
{
"query": "SELECT series_id, year FROM `bigquery-public-data.bls.c_cpi_u` WHERE 1=0",
}
returns
{
"kind": "bigquery#queryResponse",
"schema": {
"fields": [
{
"name": "series_id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "year",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
},
"jobReference": {
"projectId": "profound-ranger-99202",
"jobId": "job_O_Y2MADZRJcEFRT-WTCFR_xUS4U"
},
"totalRows": "0",
"totalBytesProcessed": "135584",
"jobComplete": true,
"cacheHit": false
}
From which, one can extract the table metadata.
Here I use the public dataset provided by bigquery as an example.
library(bigrquery)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
mybilling <- readRDS("/Users/peishenwu/bqbilling.rds")
bigrquery::set_service_token("/Users/peishenwu/bqbilling.json")
con1 <- DBI::dbConnect(bigrquery::dbi_driver(),
dataset = "noaa_gsod",
project = "bigquery-public-data",
billing = mybilling,
use_legacy_sql = F)
con2 <- DBI::dbConnect(bigrquery::dbi_driver(),
dataset = "samples",
project = "bigquery-public-data",
billing = mybilling,
use_legacy_sql = F)
tbl1 <- tbl(con1, "gsod2008")
tbl2 <- tbl(con2, "gsod")
## show output
tbl1 %>% head
#> 623.0 megabytes processed
#> # Source: lazy query [?? x 32]
#> # Database: BigQueryConnection
#> stn wban year mo da temp count_temp dewp count_dewp slp
#> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl> <int> <dbl>
#> 1 983275 99999 2008 08 28 77.5 4 72.1 4 9999.9
#> 2 977800 99999 2008 08 13 67.9 4 59.5 4 9999.9
#> 3 977800 99999 2008 04 16 68.1 4 61.5 4 9999.9
#> 4 977900 99999 2008 01 12 83.2 4 9999.9 0 9999.9
#> 5 977600 99999 2008 03 18 83.3 4 77.3 4 9999.9
#> 6 977800 99999 2008 07 15 69.9 4 59.6 4 9999.9
#> # ... with 22 more variables: count_slp <int>, stp <dbl>, count_stp <int>,
#> # visib <dbl>, count_visib <int>, wdsp <chr>, count_wdsp <chr>,
#> # mxpsd <chr>, gust <dbl>, max <dbl>, flag_max <chr>, min <dbl>,
#> # flag_min <chr>, prcp <dbl>, flag_prcp <chr>, sndp <dbl>, fog <chr>,
#> # rain_drizzle <chr>, snow_ice_pellets <chr>, hail <chr>, thunder <chr>,
#> # tornado_funnel_cloud <chr>
tbl2 %>% head
#> 16.1 gigabytes processed
#> # Source: lazy query [?? x 31]
#> # Database: BigQueryConnection
#> station_number wban_number year month day mean_temp
#> <int> <int> <int> <int> <int> <dbl>
#> 1 915925 99999 2009 3 9 79.5
#> 2 944300 99999 1993 8 8 58.7
#> 3 61500 99999 1975 6 12 64.2
#> 4 247680 99999 1987 10 27 -12.3
#> 5 72450 99999 2009 1 13 37.1
#> 6 173520 99999 1998 6 10 77.8
#> # ... with 25 more variables: num_mean_temp_samples <int>,
#> # mean_dew_point <dbl>, num_mean_dew_point_samples <int>,
#> # mean_sealevel_pressure <dbl>,
#> # num_mean_sealevel_pressure_samples <int>, mean_station_pressure <dbl>,
#> # num_mean_station_pressure_samples <int>, mean_visibility <dbl>,
#> # num_mean_visibility_samples <int>, mean_wind_speed <dbl>,
#> # num_mean_wind_speed_samples <int>, max_sustained_wind_speed <dbl>,
#> # max_gust_wind_speed <dbl>, max_temperature <dbl>,
#> # max_temperature_explicit <lgl>, min_temperature <dbl>,
#> # min_temperature_explicit <lgl>, total_precipitation <dbl>,
#> # snow_depth <dbl>, fog <lgl>, rain <lgl>, snow <lgl>, hail <lgl>,
#> # thunder <lgl>, tornado <lgl>
##
tbl1 %>% left_join(tbl2, by = c("year" = "year",
"mo" = "month",
"da" = "day"))
#> Error: Not found: Table bigquery-public-data:noaa_gsod.gsod
tbl2 %>% left_join(tbl1, by = c("year" = "year",
"month" = "mo",
"day" = "da"))
#> Error: Not found: Table bigquery-public-data:samples.gsod2008
The joins should work, however it seems to assume that both LHS and RHS came from the same dataset as the LHS, which is not true in this case.
+1
This is still an issue, and is causing a problem for me as we have views and tables in different datasets (so permissions can be assigned at dataset level). Are there any workarounds I could patch in in the meantime?
I'm closing this issue since it's actually two separate issues (linked above, with minimal reprexes)
The
src_bigquery
function inbigrquery
takes in adataset
argument, which specifies which dataset tables are accessed from.While this is, of course, required in order to specify a table, there is no reason that when tables are joined or otherwise compared across datasets in the same project that there needs to be copying of one table into the other dataset.
I suggest that
src_bigquery
ought to allow for the input of either a particular dataset or an entire project, in which casetbl
can source from any table in any dataset in the project. Provided that the subsequent queries use the fully-qualified table names (e.g.publicdata.natality
as opposed to justnatality
), there should not be any problems.This is a current issue for me: I have a project which looks like:
If I want to access these using
dplyr
and join them then I'd do something likeThis would throw an error:
Error: x and y don't share the same src. Set copy = TRUE to copy y into x's source (this may be time consuming).
However, this is not necessary, as datasets across projects are allowed to access each other. I'd like to avoid the copying in this scenario.
That is (as in, say, MySQL) the following query would work just fine: