Enable partition support in dplyr

grantmcdermott commented 5 years ago

Many BigQuery tables are date-partitioned, This allows for cheap querying from the BQ console, since users can specify the _PARTITIONTIME pseudo column to filter for a date or a range of dates. Here's an example from the Global Fishing Watch project:

SELECT
  SUM(fishing_hours) AS total_fishing_hours,
  flag
FROM
  [global-fishing-watch:global_footprint_of_fisheries.fishing_effort]
WHERE
  _PARTITIONTIME >= "2016-01-01 00:00:00"
  AND _PARTITIONTIME < "2017-01-01 00:00:00"
GROUP BY
  flag
ORDER BY
  total_fishing_hours DESC

Is it possible to support this kind of partitioning in in bigrquery / dplyr? I've tried to hack the translation using !! but my tidyeval skills failed me.

Thanks for considering.

grantmcdermott commented 5 years ago

Ah, I think I've figured a workaround using !! build_sql().

Assuming that you have already connected to the global-fishing-watch.global_footprint_of_fisheries dataset, the following dplyr code replicates the pure SQL query above:

effort <- tbl(con, "fishing_effort")

effort %>%
  filter(
    !!build_sql("_PARTITIONTIME") >= "2016-01-01 00:00:00",
    !!build_sql("_PARTITIONTIME") < "2017-01-01 00:00:00"
    ) %>%
  group_by(flag) %>%
  summarise(total_fishing_hours = sum(fishing_hours)) %>%
  arrange(desc(total_fishing_hours)) %>%
  collect()

It would still be nice if there were a "native" dplyr way to do this, but I'm happy to close this if you are.

hadley commented 5 years ago

You didn't provide a reprex, but I'm almost certain that this should work:

effort %>%
  filter(
    `_PARTITIONTIME` >= "2016-01-01 00:00:00",
    `_PARTITIONTIME` < "2017-01-01 00:00:00"
    ) %>%
  group_by(flag) %>%
  summarise(total_fishing_hours = sum(fishing_hours)) %>%
  arrange(desc(total_fishing_hours)) %>%
  collect()

r-dbi / bigrquery

Enable partition support in dplyr #299