rOpenSpain / spanishoddata

Access national high-quality and open-access datasets on movement patterns derived from mobile telephone datasets
https://ropenspain.github.io/spanishoddata/
Other
24 stars 0 forks source link

Note: restructure downloaded CSVs into hive style for better performance #13

Closed e-kotov closed 3 months ago

e-kotov commented 3 months ago

By default CSV files are structured as follows:

estudios_basicos/por-distritos/viajes/ficheros-diarios/2022-01/20220101_Viajes_distritos.csv.gz
estudios_basicos/por-distritos/viajes/ficheros-diarios/2022-01/20220102_Viajes_distritos.csv.gz

This way, both {duckdb} and {arrow} will have to scan whole files for queries that involve a date filter, even though the data is already partitioned nicely into individual days.

Therefore, it is better to download data into a hive-style structure like so:

estudios_basicos/por-distritos/viajes/ficheros-diarios/year=2022/month=01/day=01/Viajes_distritos.csv.gz
estudios_basicos/por-distritos/viajes/ficheros-diarios/year=2022/month=01/day=02/Viajes_distritos.csv.gz

This way, even though we already have a full ISO date field inside the CSVs, both {duckdb} and {arrow} will be able to filter much faster using the columns generated from the hive-style file structure. The year, month and day columns can be dropped if not needed and they take literally no additional space anyway.

{duckdb} seems to support hive_partitioning = true for read_csv.

{arrow} definitely supports hive-style for connecting to CSV folders with open_dataset().

e-kotov commented 3 months ago

Some evidence.

Here's some internal code:

b1 <- bench::mark(iterations = 5, check = FALSE,
    hive_date = {dplyr::tbl(con, "trips") |>
      dplyr::distinct(full_date) |> 
      dplyr::collect()}, # this is prefiltered using custom SQL query using only the columns (year, month, day) that we know are constructed from the hive style partitioning

    full_date = {dplyr::tbl(con, "trips_view") |>
      dplyr::filter(full_date %in% dates) |>
      dplyr::distinct(full_date) |>
      dplyr::collect()} # this is causing DuckDB to scan ALL csv.gz files in the folder because it has to match the desired dates with full_date column
  )

  bench:::plot.bench_mark(b1, type = "violin") + ggpubr::theme_pubclean(base_size = 24)

image

So a commit to implement this pre-filtering with hive style files placement is coming up.

Robinlovelace commented 3 months ago

Done @e-kotov ?

e-kotov commented 3 months ago

Yes, solved with recent PRs.