Closed e-kotov closed 3 months ago
Some evidence.
Here's some internal code:
b1 <- bench::mark(iterations = 5, check = FALSE,
hive_date = {dplyr::tbl(con, "trips") |>
dplyr::distinct(full_date) |>
dplyr::collect()}, # this is prefiltered using custom SQL query using only the columns (year, month, day) that we know are constructed from the hive style partitioning
full_date = {dplyr::tbl(con, "trips_view") |>
dplyr::filter(full_date %in% dates) |>
dplyr::distinct(full_date) |>
dplyr::collect()} # this is causing DuckDB to scan ALL csv.gz files in the folder because it has to match the desired dates with full_date column
)
bench:::plot.bench_mark(b1, type = "violin") + ggpubr::theme_pubclean(base_size = 24)
trips
is a "view" of csv files pre-filtered by dates
trips_view
is the root folder with all the CSVs (and I only have about 20 files, not all 100+ files, so it would be even longer, much longer...)
So a commit to implement this pre-filtering with hive style files placement is coming up.
Done @e-kotov ?
Yes, solved with recent PRs.
By default CSV files are structured as follows:
This way, both
{duckdb}
and{arrow}
will have to scan whole files for queries that involve a date filter, even though the data is already partitioned nicely into individual days.Therefore, it is better to download data into a hive-style structure like so:
This way, even though we already have a full ISO date field inside the CSVs, both
{duckdb}
and{arrow}
will be able to filter much faster using the columns generated from the hive-style file structure. The year, month and day columns can be dropped if not needed and they take literally no additional space anyway.{duckdb}
seems to supporthive_partitioning = true
forread_csv
.{arrow}
definitely supports hive-style for connecting to CSV folders withopen_dataset()
.