voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

ENG-3640: Add CSV taxi dataset and benchmarks from vroom #93

Closed alistaire47 closed 2 years ago

alistaire47 commented 2 years ago

Closes #3 by adding the CSV taxi dataset used by vroom to known_datasets. Notes:

alistaire47 commented 2 years ago

What this looks like for context:

> ds
FileSystemDataset with 12 csv files
medallion: string
hack_license: string
vendor_id: string
pickup_datetime: timestamp[s]
payment_type: string
fare_amount: double
surcharge: double
mta_tax: double
tip_amount: double
tolls_amount: double
total_amount: double
> dim(ds)
[1] 173179759        11
> ds$files
 [1] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_1.csv.gz" 
 [2] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_10.csv.gz"
 [3] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_11.csv.gz"
 [4] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_12.csv.gz"
 [5] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_2.csv.gz" 
 [6] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_3.csv.gz" 
 [7] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_4.csv.gz" 
 [8] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_5.csv.gz" 
 [9] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_6.csv.gz" 
[10] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_7.csv.gz" 
[11] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_8.csv.gz" 
[12] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_9.csv.gz" 
> ds %>% head() %>% collect()
# A tibble: 6 × 11
  medallion           hack_license vendor_id pickup_datetime     payment_type fare_amount surcharge mta_tax
  <chr>               <chr>        <chr>     <dttm>              <chr>              <dbl>     <dbl>   <dbl>
1 89D227B655E5C82AEC… BA96DE419E7… CMT       2013-01-01 15:11:48 CSH                  6.5       0       0.5
2 0BD7C8F5BA12B88E0B… 9FD8F69F080… CMT       2013-01-06 00:18:35 CSH                  6         0.5     0.5
3 0BD7C8F5BA12B88E0B… 9FD8F69F080… CMT       2013-01-05 18:49:41 CSH                  5.5       1       0.5
4 DFD2202EE08F7A8DC9… 51EE87E3205… CMT       2013-01-07 23:54:15 CSH                  5         0.5     0.5
5 DFD2202EE08F7A8DC9… 51EE87E3205… CMT       2013-01-07 23:25:03 CSH                  9.5       0.5     0.5
6 20D9ECB2CA0767CF7A… 598CCE5B9C1… CMT       2013-01-07 15:27:48 CSH                  9.5       0       0.5
# … with 3 more variables: tip_amount <dbl>, tolls_amount <dbl>, total_amount <dbl>
(base)  alistaire@snork  ~/data/benchmarks/data  ls -lh trip_fare.7z
-rw-r--r--  1 alistaire  staff   1.6G Jun 20  2014 trip_fare.7z
(base)  alistaire@snork  ~/data/benchmarks/data  ls -lh taxi_2013
total 14429216
-rw-r--r--  1 alistaire  staff   617M May 13 12:20 trip_fare_1.csv.gz
-rw-r--r--  1 alistaire  staff   562M May 13 12:21 trip_fare_10.csv.gz
-rw-r--r--  1 alistaire  staff   534M May 13 12:22 trip_fare_11.csv.gz
-rw-r--r--  1 alistaire  staff   575M May 13 12:23 trip_fare_12.csv.gz
-rw-r--r--  1 alistaire  staff   614M May 13 12:24 trip_fare_2.csv.gz
-rw-r--r--  1 alistaire  staff   630M May 13 12:25 trip_fare_3.csv.gz
-rw-r--r--  1 alistaire  staff   656M May 13 12:26 trip_fare_4.csv.gz
-rw-r--r--  1 alistaire  staff   669M May 13 12:27 trip_fare_5.csv.gz
-rw-r--r--  1 alistaire  staff   583M May 13 12:28 trip_fare_6.csv.gz
-rw-r--r--  1 alistaire  staff   558M May 13 12:29 trip_fare_7.csv.gz
-rw-r--r--  1 alistaire  staff   452M May 13 12:30 trip_fare_8.csv.gz
-rw-r--r--  1 alistaire  staff   493M May 13 12:31 trip_fare_9.csv.gz

Every month is a CSV. The actual data is very suspect; many months stop before the last day of the month, and it contains more taxi medallion numbers than there are people in New York.

alistaire47 commented 2 years ago

@jonkeane

This looks good. We can totally punt this to another issue, but do the current CSV benchmarks that this will be added to match the vroom ones? If not that's totally ok, we can do a follow on for that.

Well...not really? This creates a dataset, as can be handled nicely by arrow::open_dataset(). {vroom}'s benchmarks are orchestrated in this makefile, which runs a benchmarking script on each file of the dataset (if I understand what's happening correctly). What it's doing is pretty similar to our file reading and writing benchmarks, but things are running differently enough such that it's hard to compare.

Happy to make a story if you like, but not quite sure what we want to change. Let read/write benchmarks iterate over files in datasets? Add more dataset benchmarks that add comparative versions that iterate over files?

jonkeane commented 2 years ago

Ah right right, for a second I thought this was being added to known_sources but it's actually known_datasets so now folded in automatically to the reading bits.

So we will need to add a benchmark that actually exercises this dataset (both as a straight read like vroom does and possibly query against it)

alistaire47 commented 2 years ago

Added https://github.com/ursacomputing/arrowbench/issues/96 (and accompanying ENG-4815) to go use this dataset in benchmarks