gulfofmaine / sdm_workflow

A repository to help streamline the species distribution model development and prediction workflow.
MIT License
2 stars 0 forks source link

Trawl data processing #15

Closed aallyn closed 2 years ago

aallyn commented 3 years ago

Moved from "discussions" within Gulf of Maine/Teams @LGCarlson Based on Kathy's email and DFO data being available, I figured I'd try to start a discussion thread here rather than email/slack as it sounds like things are going to start moving forward on the trawl data front and others on the team might have input on how we pre-process these data as well.

I'm not sure where to start. Maybe with what we have right now? I went in and grabbed some of your code for the Pew Leading/Trailing Edges project that processes the trawl data into a "tidy"-er data set ready for analysis of distributions -- which in my mind I think of as having each row being a unique tow - species - total biomass/abundance. I put those pieces into a function nefsc_trawl_prep_func.R, which is on GitHub here.

The other way of thinking about this I guess is what we want? Based on my very limited understanding of how an idealized pipeline would work, I think we are going to want (need?) a single "processing" function for each of the different data sets. The key part is that we want each of the processing functions to output a "tidy" data file that has the exact same columns (names, value scales) as the others so they are easily combined. The second twist is from these data we may also want to have a function that tidies things up, but doesn't sum biomass/abundance across all of the different size bins? You and @adamkemberling may have some insight here based on what you've done for Pew project and what Adam has done with size-spectrum work as part of the WARMEM project.

adamkemberling commented 3 years ago

As a way to clarify what steps are going on in the survdat cleanup I've been doing for the size spectrum stuff I have started to break the "cleanup function" into discrete steps. It feels weird to undo something that does every step into different functions that do individual steps, but it makes it easier to look at the pieces on their own and do testing than it was breaking the single behemoth function.

Here is where the build code for size spectrum stuff is living: https://github.com/adamkemberling/nefsc_trawl/blob/master/R/01_nefsc_ss_build.R

aallyn commented 3 years ago

Awesome! Do people have a preference for how to go forward? Now we've got a couple scripts and functions related to this prep step -- one in the nefsc_ss_build, one in "nefsc_trawl_prep_func.R" and then it also sounds like @LGCarlson has also been working on this. I'm good to go with @adamkemberling survdat_prep function in the https://github.com/adamkemberling/nefsc_trawl/blob/master/R/01_nefsc_ss_build.R code and using that as the standard. People are welcome to suggest edits by submitting pull requests to @adamkemberling. I'd also suggest that when @LGCarlson begins work with the DFO data, that we use as much of the same code as possible and just make changes as necessary to account for differences among the surveys?

One question I have is how (and where) we should be storing these different functions -- as stand alone files or within an R code/script that includes multiple different functions? My preference would be the former. Thoughts?

adamkemberling commented 3 years ago

If there is a group of functions that are used for similar purposes or together as part of a consistent workflow it could be helpful to have them in the same place. Easier to keep track of than a file for every function. Thats my thinking at least.

Where that function or those functions should live, whether it deserves its own repo or can exist in these task-specific ones is another question I haven't really thought through.

aallyn commented 3 years ago

Following on from the 2/11/21 meeting, a few new to-dos within this workflow step. Trying out this whole list thing to see how that works for tracking them...

adamkemberling commented 3 years ago

Follow up on my January 11th comment:

Breaking the very large function worked well until it didn't work at all. At which point I had to save the entire file under a new name and go back to the original. My takeaway from this was to build it in discrete steps if you have the foresight to start that way, or if its obvious what discrete breaks to use, and if there isn't any downstream dependency. If something is working as intended it might not be worth the tinkering to fix.

adamkemberling commented 3 years ago

Update on most recent NOAA data:

Looking at the species list @aallyn sent specifically, the transition between data from the RV Albatross to the RV Henry Bigelow look a lot more like one consistent record. What I mean by this is that in most cases there is no longer some large jump in biomass/abundance at that transition point. A sign that conversion factors are ddoing what they are intended to do.

Additionally, following the cconversation on slack I have changed the cleanup code I had to not drop columns, so in the future if we ask Sean/NOAA for more columns the cleanup code won't have to change for those to come through the cleanup steps.