ropensci / bikedata

:bike: Extract data from public hire bicycle systems
https://docs.ropensci.org/bikedata
81 stars 16 forks source link

check string/int in 1st column parsing #78

Closed tbdv closed 6 years ago

tbdv commented 6 years ago

rough note to self if nothing else.

on bikedata from cran i get:

Error in rcpp_import_to_trip_table(bikedb, flists$flist_csv, ci, header_file_name(),  : 
  basic_string::_M_construct null not valid

following:

data_dir <- "/my_example_dir"
dl_bikedata (city = 'sf', data_dir = data_dir)
bikedb <- file.path (data_dir, 'sfdb')
store_bikedata (data_dir = data_dir, city = 'sf', bikedb = bikedb)

inspecting the headers, looks like it might be because some April and May CSV's start in the first column with a string where previously they were ints.

otoh same import fine at this commit

March 2018 Data

"duration_sec" "start_time" "end_time" "start_station_id" "start_station_name" "start_station_latitude" "start_station_longitude" "end_station_id" "end_station_name" "end_station_latitude" "end_station_longitude" "bike_id" "user_type" "member_birth_year" "member_gender" "bike_share_for_all_trip"
71766 "2018-03-31 16:58:33.1490" "2018-04-01 12:54:39.2630" 4 "Cyril Magnin St at Ellis St" 37.78588062694133 -122.4089150084319 6 "The Embarcadero at Sansome St" 37.80477 -122.403234 341 "Customer" 1964 "Female" "No"

April 2018 Data

"duration_sec" "start_time" "end_time" "start_station_id" "start_station_name" "start_station_latitude" "start_station_longitude" "end_station_id" "end_station_name" "end_station_latitude" "end_station_longitude" "bike_id" "user_type" "member_birth_year" "member_gender" "bike_share_for_all_trip"
72393 "2018-04-30 22:49:32.6180" "2018-05-01 18:56:06.3010" 4 "Cyril Magnin St at Ellis St" 37.78588062694133 -122.4089150084319 4 "Cyril Magnin St at Ellis St" 37.78588062694133 -122.4089150084319 3940 "Customer","","" "No"

May 2018 Data

"duration_sec" "start_time" "end_time" "start_station_id" "start_station_name" "start_station_latitude" "start_station_longitude" "end_station_id" "end_station_name" "end_station_latitude" "end_station_longitude" "bike_id" "user_type" "member_birth_year" "member_gender" "bike_share_for_all_trip"
56791 "2018-05-31 21:41:51.4750" "2018-06-01 13:28:22.7220" 44 "Civic Center/UN Plaza BART Station (Market St at McAllister St)" 37.7810737 -122.4117382 78 "Folsom St at 9th St" 37.7737172 -122.4116467 1230 "Customer","","" "No"
tbdv commented 6 years ago

devil is in the "user_type" column starting in april, perhaps?

mpadge commented 6 years ago

Thanks for that, and yeah, you're absolutely right. The problem arises because the following column ("member_birth_year") is quoted when it's empty (so just ""), but unquoted when not (so 1974, not "1974"). Fix on it's way ...

mpadge commented 6 years ago

That commit simply forces the quotation structure of each file to be re-defined for every single line. This results in less efficient reading for SF. My timings show an increase in one sample from 3.8 to 4.5 seconds, so a bit under 20% increase in reading time. But we're still only talking a handful of seconds for all of SF, so that shouldn't be considered relevant.

(It would of course be possible to write a custom function to avoid this, but that's precisely what the old (pre v0.2) version did and it was very difficult to keep track of all the custom routines for each city. The whole point of the latest version is to avoid the need for tailored routines for each data quirk.)

Thanks @tbdv for finding this bug!