ropensci / bikedata

:bike: Extract data from public hire bicycle systems
https://docs.ropensci.org/bikedata
81 stars 16 forks source link

specify data structures via json header #63

Closed tbuckl closed 6 years ago

tbuckl commented 6 years ago

it would be easier for users to contribute text specifications for headers and column types than edit code.

when i am trying to add data for the bay area, i find that the meat of string processing on the data takes place in read-city-files.cpp

as an R user that knows nothing about C++, i've been relying the data structure (CSV headers and types) in order to understand the C++.

this method has helped me to understand what the different delimiter lines were doing for each function, for example for boston post 2018

for example, below i've printed out the headers for each file from a sample from dl_bikedata for a city and put them into a json file:

https://gist.github.com/tibbl35/81b618ba37e806fb1d93e44f32b25652

this was a manual process that could probably be easily improved, potentially through standard R.

or just through manual editing in github.

i guess the main goal here is a separation of concerns: (1) describing the data types (headers, etc) and (2) writing the script that makes that data consistent.

@mpadge let me know if there's an advantage to this.

i can see how it might not really be necessary for the scope of data that are available. it could be that just 1 or 2 people editing some C++ works for the scope of ~50-100 bike data providers? not something i know a lot about.

tbuckl commented 6 years ago

@mpadge i noticed that you thumbs uped this while i was making a bunch of edits on it! :)

mpadge commented 6 years ago

Using a json header is a really good idea, and I will definitely think about how that might be incorporated because it would make things a lot easier. The single big obstacle to just doing that straight off is the different ways that data fields are quoted. The messiest example is the current read_one_line_boston function, which uses three different ways to quote/not quote the fields. But an additional binary quoted flag could also just be incorporated within the json structure, and the data parsed accordingly. Some tricks will be needed because there is no pattern for some cities and the data first need to be inspected to determine the format, but hopefully I'll think of a clever way around that.

There is definitely a need to clean up the entire src/read-city-files.cpp structure, and you are right that a user-controllable .json file would much more readily enable other cities to be incorporated by just specifying the structure in that file and ignoring the C++. I'll rename the issue accordingly, and we can get working on it

mpadge commented 6 years ago

This is currently being developed in a new data-headers branch. This is working okay, but still requires manual mapping of files to structures, currently done in src/sqlite3db-add-data/get_file_headers.

Next task: Replace this manual specification with some clever C++ parsing, through just stepping through sequences of , and ".

mpadge commented 6 years ago

Now done via commits leading up to this one. It should now all work, so closing. Can re-open if necessary when addressing #61, #62