Clarify data pipeline - Githubissues

          > @IJBG - a thought I had while tinkering ... we're calling a lot of datasets `mbbs` and say things like `Any mbbs dataset, either the whole survey area or one county`. But what do we mean by `mbbs dataset`? What is it's shape (e.g. required columns etc)? It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.

I think defining that would be good. Right now, 'mbbs dataset' refers to a post-processing dataset that's gone through inst/import_data.R. The key columns are mbbs_county, route_num, route_ID, common_name, and count. But there are plenty of functions that require other columns as well (eg. process_species_comments needs the species_comments column).

With the goal of having two clear end-user datasets at the route and stop level, there's now also the 2nd level of processing happening after inst/import_data.

So we've got:

pre-processing data: prev. website data, ebird, historical .xls, transcribed data(prev. website data by stop). This is all held in inst/extdata
1st processing: goes through inst/import_data, files held in data/- contains all information we've got, both route-level and stop-level data. mbbs_routes and mbbs_survey_events don't fall into the definition used above of 'Any mbbs dataset, either the whole survey area or one county'). Would it be better to define these functions as requiring explicitly mbbs mbbs_orange mbbs_durham or mbbs_chatham?
2nd processing (datasets that where all data is either A. route level summaries or B. stop-level. columns like 'count_raw' are removed)

Originally posted by @IJBG in https://github.com/nc-minibbs/mbbs/issues/54#issuecomment-2120823709

nc-minibbs / mbbs

Clarify data pipeline #60