Closed alistaire47 closed 2 years ago
What this looks like for context:
> ds
FileSystemDataset with 12 csv files
medallion: string
hack_license: string
vendor_id: string
pickup_datetime: timestamp[s]
payment_type: string
fare_amount: double
surcharge: double
mta_tax: double
tip_amount: double
tolls_amount: double
total_amount: double
> dim(ds)
[1] 173179759 11
> ds$files
[1] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_1.csv.gz"
[2] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_10.csv.gz"
[3] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_11.csv.gz"
[4] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_12.csv.gz"
[5] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_2.csv.gz"
[6] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_3.csv.gz"
[7] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_4.csv.gz"
[8] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_5.csv.gz"
[9] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_6.csv.gz"
[10] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_7.csv.gz"
[11] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_8.csv.gz"
[12] "/Users/alistaire/data/benchmarks/data/taxi_2013/trip_fare_9.csv.gz"
> ds %>% head() %>% collect()
# A tibble: 6 × 11
medallion hack_license vendor_id pickup_datetime payment_type fare_amount surcharge mta_tax
<chr> <chr> <chr> <dttm> <chr> <dbl> <dbl> <dbl>
1 89D227B655E5C82AEC… BA96DE419E7… CMT 2013-01-01 15:11:48 CSH 6.5 0 0.5
2 0BD7C8F5BA12B88E0B… 9FD8F69F080… CMT 2013-01-06 00:18:35 CSH 6 0.5 0.5
3 0BD7C8F5BA12B88E0B… 9FD8F69F080… CMT 2013-01-05 18:49:41 CSH 5.5 1 0.5
4 DFD2202EE08F7A8DC9… 51EE87E3205… CMT 2013-01-07 23:54:15 CSH 5 0.5 0.5
5 DFD2202EE08F7A8DC9… 51EE87E3205… CMT 2013-01-07 23:25:03 CSH 9.5 0.5 0.5
6 20D9ECB2CA0767CF7A… 598CCE5B9C1… CMT 2013-01-07 15:27:48 CSH 9.5 0 0.5
# … with 3 more variables: tip_amount <dbl>, tolls_amount <dbl>, total_amount <dbl>
(base) alistaire@snork ~/data/benchmarks/data ls -lh trip_fare.7z
-rw-r--r-- 1 alistaire staff 1.6G Jun 20 2014 trip_fare.7z
(base) alistaire@snork ~/data/benchmarks/data ls -lh taxi_2013
total 14429216
-rw-r--r-- 1 alistaire staff 617M May 13 12:20 trip_fare_1.csv.gz
-rw-r--r-- 1 alistaire staff 562M May 13 12:21 trip_fare_10.csv.gz
-rw-r--r-- 1 alistaire staff 534M May 13 12:22 trip_fare_11.csv.gz
-rw-r--r-- 1 alistaire staff 575M May 13 12:23 trip_fare_12.csv.gz
-rw-r--r-- 1 alistaire staff 614M May 13 12:24 trip_fare_2.csv.gz
-rw-r--r-- 1 alistaire staff 630M May 13 12:25 trip_fare_3.csv.gz
-rw-r--r-- 1 alistaire staff 656M May 13 12:26 trip_fare_4.csv.gz
-rw-r--r-- 1 alistaire staff 669M May 13 12:27 trip_fare_5.csv.gz
-rw-r--r-- 1 alistaire staff 583M May 13 12:28 trip_fare_6.csv.gz
-rw-r--r-- 1 alistaire staff 558M May 13 12:29 trip_fare_7.csv.gz
-rw-r--r-- 1 alistaire staff 452M May 13 12:30 trip_fare_8.csv.gz
-rw-r--r-- 1 alistaire staff 493M May 13 12:31 trip_fare_9.csv.gz
Every month is a CSV. The actual data is very suspect; many months stop before the last day of the month, and it contains more taxi medallion numbers than there are people in New York.
@jonkeane
This looks good. We can totally punt this to another issue, but do the current CSV benchmarks that this will be added to match the vroom ones? If not that's totally ok, we can do a follow on for that.
Well...not really? This creates a dataset, as can be handled nicely by arrow::open_dataset()
. {vroom}'s benchmarks are orchestrated in this makefile, which runs a benchmarking script on each file of the dataset (if I understand what's happening correctly). What it's doing is pretty similar to our file reading and writing benchmarks, but things are running differently enough such that it's hard to compare.
Happy to make a story if you like, but not quite sure what we want to change. Let read/write benchmarks iterate over files in datasets? Add more dataset benchmarks that add comparative versions that iterate over files?
Ah right right, for a second I thought this was being added to known_sources
but it's actually known_datasets
so now folded in automatically to the reading bits.
So we will need to add a benchmark that actually exercises this dataset (both as a straight read like vroom does and possibly query against it)
Added https://github.com/ursacomputing/arrowbench/issues/96 (and accompanying ENG-4815) to go use this dataset in benchmarks
Closes #3 by adding the CSV taxi dataset used by vroom to
known_datasets
. Notes:method = "wget"
indownload.files()
, which does require it to be installed, but its autorecovery makes it more robust than curl here.Suggests:
to unpack 7zip archive.gsub()
-> write lines approach to remove excess spaces in headers. This is quite slow compared to vroom's use of sed, but is cross-platform. Better alternatives welcome.temp
. Can remove this if we want.