Closed bsaul closed 4 months ago
@IJBG - I just started to tinker on refactoring process_species_comments
based on my comments in #49. I thought we could tinker on it together -- or at least you should review (when the PR is ready) to see what changes were made.
@IJBG - can you paste a single species comment that is well-formed that we could use as a test case?
Cool! Yes, I would happy to tinker on it or review when it's ready. Here's some test species comments for the multiple test cases: Stops: 1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20 st 1=1, st 2=2, st 3=3, st 4=4, st 20=20 st 20 1,2,3,4,,,,,,,,,,,,,,,,20 1=1,2=2,3=3,4=4,20=20
psc_test_cases.csv ^csv with the testcases
In the code right now everything gets a label that accounts for what part of the for loop it was affected by, the sc_note column. This is a mess of a little function but it returns if any rows were not caught. You do have to run it before the end of the whole process_species_comments function as at the end sc_note gets removed.
psc_report_leftovers <- function(df) { return(sum(is.na(df$sc_note) == TRUE)) }
@IJBG - a thought I had while tinkering ... we're calling a lot of datasets mbbs
and say things like Any mbbs dataset, either the whole survey area or one county
. But what do we mean by mbbs dataset
? What is it's shape (e.g. required columns etc)? It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.
It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.
We don't need to do this as part of this PR, FWIW, but defining the stages of our processing step could help us be more organized in our code.
@IJBG - a thought I had while tinkering ... we're calling a lot of datasets
mbbs
and say things likeAny mbbs dataset, either the whole survey area or one county
. But what do we mean bymbbs dataset
? What is it's shape (e.g. required columns etc)? It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.
I think defining that would be good. Right now, 'mbbs dataset' refers to a post-processing dataset that's gone through inst/import_data.R. The key columns are mbbs_county, route_num, route_ID, common_name, and count. But there are plenty of functions that require other columns as well (eg. process_species_comments needs the species_comments column).
With the goal of having two clear end-user datasets at the route and stop level, there's now also the 2nd level of processing happening after inst/import_data.
So we've got:
pre-processing data: prev. website data, ebird, historical .xls, transcribed data(prev. website data by stop). This is all held in inst/extdata
1st processing: goes through inst/import_data, files held in data/
- contains all information we've got, both route-level and stop-level data. mbbs_routes
and mbbs_survey_events
don't fall into the definition used above of 'Any mbbs dataset, either the whole survey area or one county'). Would it be better to define these functions as requiring explicitly mbbs
mbbs_orange
mbbs_durham
or mbbs_chatham
?
2nd processing (datasets that where all data is either A. route level summaries or B. stop-level. columns like 'count_raw' are removed)
I've been tinkering with this but struggling with the concept of how to use case_whens to change columns s1:s20 based on positions 1:20 of split_list. Is using case_whens and mutate_across the right way to go about circumventing the main for-loop?
Right now I'm thinking to use something like... purrr:map(stopsmbbs, process_species_comments) - handle every row individually use a case_when to route, somehow, a s1:s20 mutate(across()) to individual functions based on the str_starts
What kind of route were you thinking of?
What kind of route were you thinking of?
Hm. I'm not sure yet. I hadn't gotten to that point.
I've been tinkering with this but struggling with the concept of how to use case_whens to change columns s1:s20 based on positions 1:20 of split_list. Is using case_whens and mutate_across the right way to go about circumventing the main for-loop?
Right now I'm thinking to use something like... purrr:map(stopsmbbs, process_species_comments) - handle every row individually use a case_when to route, somehow, a s1:s20 mutate(across()) to individual functions based on the str_starts
What kind of route were you thinking of?
Oh, I think I was thinking in terms of list
processing... start with a function that takes a single comment and creates a list
of the information for the 20 stops (this list should have a length of 20). And then run that function on each comment (i.e. each row). And then reshape the result from a list of lists to a data.frame
.
Currently, the logic for processing a single comment is (I think) mushed together with processing across rows. Since (again -- I think), we don't need information from one comment in order to process another comment, we can factor the code more clearly.
@IJBG - a thought I had while tinkering ... we're calling a lot of datasets
mbbs
and say things likeAny mbbs dataset, either the whole survey area or one county
. But what do we mean bymbbs dataset
? What is it's shape (e.g. required columns etc)? It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.I think defining that would be good.
You've got a good start. I'm going to open a separate issue to discuss further (See #60).
Okay, I've got a function convert_species_comments_to_list
that takes species_comment and count and returns a list. It works when used as follows:
tests <- read.csv("tests/testthat/process_species_comments_test_cases.csv", header = TRUE) %>% prepare_to_process() convert_species_comments_to_list(tests$species_comments[1], tests$count[1])
But has issues with piping eg:
tests[1,] %>% convert_species_comments_to_list(species_comments, count) Error in convert_species_comments_to_list(., species_comments, count) : unused argument (count)
(Unclear to me right now why piping isn't working, could be an error with my R session)
Normally I would work these errors out before pushing the commits, but I think that's the coding I've got in me today, so it's an issue for tomorrow :)
Re: needing to remove the pre-dawn owling checklists, I think I need to just track down the submission ID for the three checklists and set up a manual remove, rather than trying to filter them out based on having just three stops.
Okay, I've got a function
convert_species_comments_to_list
that takes species_comment and count and returns a list. It works when used as follows:
Let's start with something like this -- just the logic of handling a single comment string:
process_comment <- \(x){
assertthat::assert_that(
rlang::is_scalar_character(x)
)
x |>
stringr::str_split_1(",") |>
str_replace_all(
c( # Convert empty strings 0
"^$" = "0",
# Remove the Stops = / Stops:
"Stop(s)?( )?(=)?(:)?( )?" = "",
# If it starts with st, drop
"( )?st(op)?( )?" = ""
)
) |>
# additional step go here
(\(out) {
assertthat::assert_that(
length(out) == 20
# additional checks can go here
)
as.integer(out)
})()
}
process_comment(c(",,,,", ",,,"))
process_comment(",,,,,,,,")
process_comment(",,,,,,,,,,,,,,,,,,,")
process_comment("Stops: 1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20")
process_comment("st 1=1, st 2=2, st 3=3, st 4=4, st 20=20")
The process_comment
function might need to factored slightly differently, but once we have the process_comment
function (that handles all the different cases), processing all the comments will be reasonably straightforward.
Let's start with something like this -- just the logic of handling a single comment string:
As a starting point, in 018df17, I added a draft process_comment
function, plus some tests in test_prepare_stop_level_data.R
.
Thoughts on how to process the "stop 15" style species comments that depend on having count as well as species_comments
What about an if statement where if the string only has one number or two numbers (eg, count the characters that are 0:9), doesn't have an equal sign, and doesn't have a comma, then it gets routed to a different process_comment that includes count?
eg. str_count(species_comments, "[0-9]") < 2 str_detect(species_comments, ",|=") = FALSE
direct_to_sp_com_function now takes a single row of the dataframe and then returns that dataframe with s1:s20 filled in. However, I can't finangle it into working with purrr::map. @bsaul Do you have any idea what I'm doing wrong? Do I need to go back to having the direct_to_sp_com_function take, explicitly, species_comments and count and just return the list?
#can't get map to work...
purrr::map_dfr(
.x = stopsmbbs,
.f = direct_to_sp_com_function
)
#but this works great!
direct_to_sp_com_function(stopsmbbs[1,])
I'm really tempted to write a for loop to solve this problem, like:
#run first row
stopsfilled <- direct_to_species_function(stopsmbbs[1,]
for(i in 2:nrow(stopsmbbs) {
temp <- direct_to_species_function(stopsmbbs[i,])
stopsfilled <- rbind(stopsfilled, temp) }
Although I sense that's not the ideal solution here.
If direct_to_sp_com needs to go back to having explicitly species_comments and count and returning only the list...
direct_to_sp_com_function <- function(species_comments, count) {
#route species comments to the relevent function to process it.
if(str_starts(species_comments, ",|[0-9]+,")) {
stop_counts_list <-
sp_com_comma_seperated(species_comments, count)
} else if (str_starts(species_comments, "[0-9]+=[0-9]+|st(op)?( )?[0-9]+=[0-9]+")) {
stop_counts_list <-
sp_com_stop_equals_count(species_comments, count)
} else if (str_starts(species_comments, "st(op)?")) {
stop_counts_list <-
sp_com_only_stop(species_comments, count)
}
stop_counts_list #return
}
purrr::map_dfr( .x = stopsmbbs, .f = direct_to_sp_com_function )
I think you want to map
on the vector on the data.frame, as in:
stopmbbs |>
mutate(
newcol = purrr::map2(.x = comment, .y = count, ~ direct_to_sp_com_function(.x , .y))
(or some variation of this).
newcol
will be a list column which we can then unpack, but see if something like the above works.
Woo! Back from TA'ing the study abroad, and I think wrapped up the refactoring. I tried both before I left and for an hour today to use purrr:: but wasn't successful, and decided to write a very small for loop to solve the problem instead.
Moved where the errors (sum s1:s20 doesn't match original count) are caught (before: in the sp_com functions, now: in process_species_comments) so that there's a print() notification about how many errors, and they get stored in a .csv for follow-up, but the loop can run without tracking all the errors down beforehand. Right now there's 34 and I should be able to solve at least some of them.
I think with that, process_species_comments is done pending your review @bsaul ? If you think it looks good I'll finish up the documentation for the functions.
Looking at this now...
The test suite is failing for me: process_comments
not found. Has this function been renamed?
For tomorrow: -[ ] Address some of the errors spreadsheet, or after review open issue to resolve when working on the 2024 new-data update
To resolve comments (thank you Bradley!!):
To neaten script
Fixed the testing and finished up addressing all the comments from your review Bradley. I moved the functions to a new script 'process_species_comments.R' as this will keep 'prepare_stop_level_data' clean for building functions related to stop-level vs route-level data there. Also, puts species_comments in line with the historical_xls (also gets its own R script) and with the paper datasets that are being transcribed (will get their own processing R script as well). This pull request is no longer a WIP 🎉@bsaul
Fixed merge conflicts
@bsaul I think it's good to go and you reviewed it last week, but just wanted to check that you approve before I squash and merge the branch
@bsaul I think it's good to go and you reviewed it last week, but just wanted to check that you approve before I squash and merge the branch
Thanks. Just saw this. Looks good on a quick skim.
Closes #51