ENG-3640: Add CSV taxi dataset and benchmarks from vroom

alistaire47 commented 2 years ago

Closes #3 by adding the CSV taxi dataset used by vroom to known_datasets. Notes:

Downloading the 7zip archive is slow; expect it to take 30-60m. Uses method = "wget" in download.files(), which does require it to be installed, but its autorecovery makes it more robust than curl here.
The download function will not redownload the archive if it already exists, but will still perform postprocessing if the unpacked directory does not exist. The archive is the real source of truth here and will not change; the postprocessed directory that arrow can read can be blown away at will.
Adds {archive} to Suggests: to unpack 7zip archive.
Uses a brute-force read lines -> gsub() -> write lines approach to remove excess spaces in headers. This is quite slow compared to vroom's use of sed, but is cross-platform. Better alternatives welcome.
gzips the CSVs for the canonical version. I figured they're still minimally edited but just smaller, so it's a nice default format; if we want an uncompressed version we can always write a copy to temp. Can remove this if we want.

alistaire47 commented 2 years ago

What this looks like for context:

> ds
FileSystemDataset with 12 csv files
medallion: string
hack_license: string
vendor_id: string
pickup_datetime: timestamp[s]
payment_type: string
fare_amount: double
surcharge: double
mta_tax: double
tip_amount: double
tolls_amount: double
total_amount: double
> dim(ds)
[1] 173179759        11
> ds$files
 [1] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_1.csv.gz" 
 [2] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_10.csv.gz"
 [3] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_11.csv.gz"
 [4] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_12.csv.gz"
 [5] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_2.csv.gz" 
 [6] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_3.csv.gz" 
 [7] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_4.csv.gz" 
 [8] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_5.csv.gz" 
 [9] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_6.csv.gz" 
[10] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_7.csv.gz" 
[11] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_8.csv.gz" 
[12] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_9.csv.gz" 
> ds %>% head() %>% collect()
# A tibble: 6 × 11
  medallion           hack_license vendor_id pickup_datetime     payment_type fare_amount surcharge mta_tax
  <chr>               <chr>        <chr>     <dttm>              <chr>              <dbl>     <dbl>   <dbl>
1 89D227B655E5C82AEC… BA96DE419E7… CMT       2013-01-01 15:11:48 CSH                  6.5       0       0.5
2 0BD7C8F5BA12B88E0B… 9FD8F69F080… CMT       2013-01-06 00:18:35 CSH                  6         0.5     0.5
3 0BD7C8F5BA12B88E0B… 9FD8F69F080… CMT       2013-01-05 18:49:41 CSH                  5.5       1       0.5
4 DFD2202EE08F7A8DC9… 51EE87E3205… CMT       2013-01-07 23:54:15 CSH                  5         0.5     0.5
5 DFD2202EE08F7A8DC9… 51EE87E3205… CMT       2013-01-07 23:25:03 CSH                  9.5       0.5     0.5
6 20D9ECB2CA0767CF7A… 598CCE5B9C1… CMT       2013-01-07 15:27:48 CSH                  9.5       0       0.5
# … with 3 more variables: tip_amount <dbl>, tolls_amount <dbl>, total_amount <dbl>

(base)  alistaire@snork  ~/data/benchmarks/data  ls -lh trip_fare.7z
-rw-r--r--  1 alistaire  staff   1.6G Jun 20  2014 trip_fare.7z
(base)  alistaire@snork  ~/data/benchmarks/data  ls -lh taxi_2013
total 14429216
-rw-r--r--  1 alistaire  staff   617M May 13 12:20 trip_fare_1.csv.gz
-rw-r--r--  1 alistaire  staff   562M May 13 12:21 trip_fare_10.csv.gz
-rw-r--r--  1 alistaire  staff   534M May 13 12:22 trip_fare_11.csv.gz
-rw-r--r--  1 alistaire  staff   575M May 13 12:23 trip_fare_12.csv.gz
-rw-r--r--  1 alistaire  staff   614M May 13 12:24 trip_fare_2.csv.gz
-rw-r--r--  1 alistaire  staff   630M May 13 12:25 trip_fare_3.csv.gz
-rw-r--r--  1 alistaire  staff   656M May 13 12:26 trip_fare_4.csv.gz
-rw-r--r--  1 alistaire  staff   669M May 13 12:27 trip_fare_5.csv.gz
-rw-r--r--  1 alistaire  staff   583M May 13 12:28 trip_fare_6.csv.gz
-rw-r--r--  1 alistaire  staff   558M May 13 12:29 trip_fare_7.csv.gz
-rw-r--r--  1 alistaire  staff   452M May 13 12:30 trip_fare_8.csv.gz
-rw-r--r--  1 alistaire  staff   493M May 13 12:31 trip_fare_9.csv.gz

Every month is a CSV. The actual data is very suspect; many months stop before the last day of the month, and it contains more taxi medallion numbers than there are people in New York.

alistaire47 commented 2 years ago

@jonkeane

This looks good. We can totally punt this to another issue, but do the current CSV benchmarks that this will be added to match the vroom ones? If not that's totally ok, we can do a follow on for that.

Well...not really? This creates a dataset, as can be handled nicely by arrow::open_dataset(). {vroom}'s benchmarks are orchestrated in this makefile, which runs a benchmarking script on each file of the dataset (if I understand what's happening correctly). What it's doing is pretty similar to our file reading and writing benchmarks, but things are running differently enough such that it's hard to compare.

Happy to make a story if you like, but not quite sure what we want to change. Let read/write benchmarks iterate over files in datasets? Add more dataset benchmarks that add comparative versions that iterate over files?

jonkeane commented 2 years ago

Ah right right, for a second I thought this was being added to known_sources but it's actually known_datasets so now folded in automatically to the reading bits.

So we will need to add a benchmark that actually exercises this dataset (both as a straight read like vroom does and possibly query against it)

alistaire47 commented 2 years ago

Added https://github.com/ursacomputing/arrowbench/issues/96 (and accompanying ENG-4815) to go use this dataset in benchmarks

voltrondata-labs / arrowbench

ENG-3640: Add CSV taxi dataset and benchmarks from vroom #93