nc-minibbs / mbbs

A repository for the Mini-Bird Breeding Survey data
https://minibbs.us
Other
2 stars 0 forks source link

WIP: refactor process_species_comments function #54

Closed bsaul closed 4 months ago

bsaul commented 6 months ago

Closes #51

bsaul commented 6 months ago

@IJBG - I just started to tinker on refactoring process_species_comments based on my comments in #49. I thought we could tinker on it together -- or at least you should review (when the PR is ready) to see what changes were made.

bsaul commented 6 months ago

@IJBG - can you paste a single species comment that is well-formed that we could use as a test case?

IJBG commented 6 months ago

Cool! Yes, I would happy to tinker on it or review when it's ready. Here's some test species comments for the multiple test cases: Stops: 1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20 st 1=1, st 2=2, st 3=3, st 4=4, st 20=20 st 20 1,2,3,4,,,,,,,,,,,,,,,,20 1=1,2=2,3=3,4=4,20=20

IJBG commented 6 months ago

psc_test_cases.csv ^csv with the testcases

In the code right now everything gets a label that accounts for what part of the for loop it was affected by, the sc_note column. This is a mess of a little function but it returns if any rows were not caught. You do have to run it before the end of the whole process_species_comments function as at the end sc_note gets removed.

' Function to return if any rows of stopsmbbs have not been accounted for. Used for testing

psc_report_leftovers <- function(df) { return(sum(is.na(df$sc_note) == TRUE)) }

bsaul commented 6 months ago

@IJBG - a thought I had while tinkering ... we're calling a lot of datasets mbbs and say things like Any mbbs dataset, either the whole survey area or one county. But what do we mean by mbbs dataset? What is it's shape (e.g. required columns etc)? It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.

bsaul commented 6 months ago

It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.

We don't need to do this as part of this PR, FWIW, but defining the stages of our processing step could help us be more organized in our code.

IJBG commented 6 months ago

@IJBG - a thought I had while tinkering ... we're calling a lot of datasets mbbs and say things like Any mbbs dataset, either the whole survey area or one county. But what do we mean by mbbs dataset? What is it's shape (e.g. required columns etc)? It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.

I think defining that would be good. Right now, 'mbbs dataset' refers to a post-processing dataset that's gone through inst/import_data.R. The key columns are mbbs_county, route_num, route_ID, common_name, and count. But there are plenty of functions that require other columns as well (eg. process_species_comments needs the species_comments column).

With the goal of having two clear end-user datasets at the route and stop level, there's now also the 2nd level of processing happening after inst/import_data.

So we've got:

IJBG commented 6 months ago

I've been tinkering with this but struggling with the concept of how to use case_whens to change columns s1:s20 based on positions 1:20 of split_list. Is using case_whens and mutate_across the right way to go about circumventing the main for-loop?

Right now I'm thinking to use something like... purrr:map(stopsmbbs, process_species_comments) - handle every row individually use a case_when to route, somehow, a s1:s20 mutate(across()) to individual functions based on the str_starts

What kind of route were you thinking of?

bsaul commented 6 months ago

What kind of route were you thinking of?

Hm. I'm not sure yet. I hadn't gotten to that point.

bsaul commented 6 months ago

I've been tinkering with this but struggling with the concept of how to use case_whens to change columns s1:s20 based on positions 1:20 of split_list. Is using case_whens and mutate_across the right way to go about circumventing the main for-loop?

Right now I'm thinking to use something like... purrr:map(stopsmbbs, process_species_comments) - handle every row individually use a case_when to route, somehow, a s1:s20 mutate(across()) to individual functions based on the str_starts

What kind of route were you thinking of?

Oh, I think I was thinking in terms of list processing... start with a function that takes a single comment and creates a list of the information for the 20 stops (this list should have a length of 20). And then run that function on each comment (i.e. each row). And then reshape the result from a list of lists to a data.frame.

bsaul commented 6 months ago

Currently, the logic for processing a single comment is (I think) mushed together with processing across rows. Since (again -- I think), we don't need information from one comment in order to process another comment, we can factor the code more clearly.

bsaul commented 6 months ago

@IJBG - a thought I had while tinkering ... we're calling a lot of datasets mbbs and say things like Any mbbs dataset, either the whole survey area or one county. But what do we mean by mbbs dataset? What is it's shape (e.g. required columns etc)? It might be worth defining the various stages of processing and the shape of the data going into and out of each stage.

I think defining that would be good.

You've got a good start. I'm going to open a separate issue to discuss further (See #60).

IJBG commented 6 months ago

Okay, I've got a function convert_species_comments_to_list that takes species_comment and count and returns a list. It works when used as follows:

tests <- read.csv("tests/testthat/process_species_comments_test_cases.csv", header = TRUE) %>% prepare_to_process() convert_species_comments_to_list(tests$species_comments[1], tests$count[1])

But has issues with piping eg: tests[1,] %>% convert_species_comments_to_list(species_comments, count) Error in convert_species_comments_to_list(., species_comments, count) : unused argument (count) (Unclear to me right now why piping isn't working, could be an error with my R session)

Normally I would work these errors out before pushing the commits, but I think that's the coding I've got in me today, so it's an issue for tomorrow :)

Re: needing to remove the pre-dawn owling checklists, I think I need to just track down the submission ID for the three checklists and set up a manual remove, rather than trying to filter them out based on having just three stops.

bsaul commented 6 months ago

Okay, I've got a function convert_species_comments_to_list that takes species_comment and count and returns a list. It works when used as follows:

Let's start with something like this -- just the logic of handling a single comment string:

process_comment <- \(x){

  assertthat::assert_that(
    rlang::is_scalar_character(x)
  )

  x |>
    stringr::str_split_1(",") |>
    str_replace_all(
      c( # Convert empty strings 0
        "^$" = "0",
        # Remove the Stops = / Stops:
        "Stop(s)?( )?(=)?(:)?( )?" = "",
        # If it starts with st, drop
        "( )?st(op)?( )?" = ""
        )
    ) |>
   # additional step go here
    (\(out) {
      assertthat::assert_that(
        length(out) == 20
        # additional checks can go here
      )
      as.integer(out)
    })()
}

process_comment(c(",,,,", ",,,"))
process_comment(",,,,,,,,")
process_comment(",,,,,,,,,,,,,,,,,,,")
process_comment("Stops: 1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20")
process_comment("st 1=1, st 2=2, st 3=3, st 4=4, st 20=20")

The process_comment function might need to factored slightly differently, but once we have the process_comment function (that handles all the different cases), processing all the comments will be reasonably straightforward.

bsaul commented 6 months ago

Let's start with something like this -- just the logic of handling a single comment string:

As a starting point, in 018df17, I added a draft process_comment function, plus some tests in test_prepare_stop_level_data.R.

IJBG commented 5 months ago

Thoughts on how to process the "stop 15" style species comments that depend on having count as well as species_comments

What about an if statement where if the string only has one number or two numbers (eg, count the characters that are 0:9), doesn't have an equal sign, and doesn't have a comma, then it gets routed to a different process_comment that includes count?

eg. str_count(species_comments, "[0-9]") < 2 str_detect(species_comments, ",|=") = FALSE

IJBG commented 5 months ago

direct_to_sp_com_function now takes a single row of the dataframe and then returns that dataframe with s1:s20 filled in. However, I can't finangle it into working with purrr::map. @bsaul Do you have any idea what I'm doing wrong? Do I need to go back to having the direct_to_sp_com_function take, explicitly, species_comments and count and just return the list?

#can't get map to work...
  purrr::map_dfr(
    .x = stopsmbbs,
    .f = direct_to_sp_com_function
  )
  #but this works great!
  direct_to_sp_com_function(stopsmbbs[1,])

I'm really tempted to write a for loop to solve this problem, like:

#run first row
stopsfilled <- direct_to_species_function(stopsmbbs[1,]

for(i in 2:nrow(stopsmbbs) { 
temp <- direct_to_species_function(stopsmbbs[i,])
stopsfilled <- rbind(stopsfilled, temp) }

Although I sense that's not the ideal solution here.

If direct_to_sp_com needs to go back to having explicitly species_comments and count and returning only the list...

direct_to_sp_com_function <- function(species_comments, count) {

  #route species comments to the relevent function to process it.
  if(str_starts(species_comments, ",|[0-9]+,")) {

    stop_counts_list <-
    sp_com_comma_seperated(species_comments, count)
  } else if (str_starts(species_comments, "[0-9]+=[0-9]+|st(op)?( )?[0-9]+=[0-9]+")) {

    stop_counts_list <- 
    sp_com_stop_equals_count(species_comments, count)

  } else if (str_starts(species_comments, "st(op)?")) {

    stop_counts_list <- 
    sp_com_only_stop(species_comments, count)

  }

  stop_counts_list #return
}
bsaul commented 5 months ago

purrr::map_dfr( .x = stopsmbbs, .f = direct_to_sp_com_function )

I think you want to map on the vector on the data.frame, as in:

stopmbbs |>
  mutate(
     newcol = purrr::map2(.x = comment, .y = count, ~ direct_to_sp_com_function(.x , .y)) 

(or some variation of this).

newcol will be a list column which we can then unpack, but see if something like the above works.

IJBG commented 4 months ago

Woo! Back from TA'ing the study abroad, and I think wrapped up the refactoring. I tried both before I left and for an hour today to use purrr:: but wasn't successful, and decided to write a very small for loop to solve the problem instead.

Moved where the errors (sum s1:s20 doesn't match original count) are caught (before: in the sp_com functions, now: in process_species_comments) so that there's a print() notification about how many errors, and they get stored in a .csv for follow-up, but the loop can run without tracking all the errors down beforehand. Right now there's 34 and I should be able to solve at least some of them.

I think with that, process_species_comments is done pending your review @bsaul ? If you think it looks good I'll finish up the documentation for the functions.

bsaul commented 4 months ago

Looking at this now...

bsaul commented 4 months ago

The test suite is failing for me: process_comments not found. Has this function been renamed?

IJBG commented 4 months ago

For tomorrow: -[ ] Address some of the errors spreadsheet, or after review open issue to resolve when working on the 2024 new-data update

To resolve comments (thank you Bradley!!):

To neaten script

IJBG commented 4 months ago

Fixed the testing and finished up addressing all the comments from your review Bradley. I moved the functions to a new script 'process_species_comments.R' as this will keep 'prepare_stop_level_data' clean for building functions related to stop-level vs route-level data there. Also, puts species_comments in line with the historical_xls (also gets its own R script) and with the paper datasets that are being transcribed (will get their own processing R script as well). This pull request is no longer a WIP 🎉@bsaul

IJBG commented 4 months ago

Fixed merge conflicts

IJBG commented 4 months ago

@bsaul I think it's good to go and you reviewed it last week, but just wanted to check that you approve before I squash and merge the branch

bsaul commented 4 months ago

@bsaul I think it's good to go and you reviewed it last week, but just wanted to check that you approve before I squash and merge the branch

Thanks. Just saw this. Looks good on a quick skim.