added proposed data structure

gavinfay commented 3 years ago

Added proposed data structure. @sgaichas do you think this makes sense? For the catches, if you did want to keep 'aggregate over fleet' as an option we could denote total catch with fleet index value of 0 (or something similar).

sgaichas commented 3 years ago

Yes, I think it makes sense. To be extra sure, here is how I am interpreting: For length comps, InpN will be total n lengths measured and the value in each lb is the proportion in that bin? And same idea for prey proportions, but total stomachs for that predator size is InpN?

gavinfay commented 3 years ago

I was thinking the input N's as more generic, i.e. the multinomial input sample size. So this could be # stomachs & # lengths, but also could be # trips or # tows, because the effective sample size is probably less than the number of fish measured.

And yes, 'lb' would be the proportion by length bin.

sgaichas commented 3 years ago

Sounds good. I'll create a metadata file as well describing what is in each column so we know what the inputN is in each case.

sgaichas commented 3 years ago

Example input files for catch, survey index, catch at length and survey length comp have been added to https://github.com/thefaylab/hydradata/tree/master/data-raw in commits https://github.com/thefaylab/hydradata/commit/0e7c37fa54427c86a33771e3f4bb05ba73f363ac and https://github.com/thefaylab/hydradata/commit/7ab16328b15a1c99cef62d1eeb27db6ea6b5ed48

filenames: observation_catch_NOBA_allfisheries.csv observation_biomass_NOBA_allsurvs.csv observation_lengths_NOBA_allfisheries.csv observation_lengths_NOBA_allsurvs.csv

These are based on NOBA Atlantis outputs and are for illustration only; they contain multiple surveys but also duplicate surveys and fisheries (same thing sampled twice). The code that generates the files is all in here but I will organize it better once I'm all the way through.

At present, InpN in the length files is the total number of lengths measured (in turn based on the effective sample size entered into atlantisom configuration files).

If the structures look ok I can make files with the real data as well, otherwise we can keep adjusting these. Please let me know.

sgaichas commented 3 years ago

Questions on the proposed survey-prey-proportions structure:

Shall I add a "Survey" column ahead of year as in the other files?

Should values be proportion by weight?--looks like it but just double checking

Do we want a column of "all other prey" so they sum to 1?

sgaichas commented 3 years ago

@gavinfay the survey-prey-proportions file is almost ready--just two more (related) questions:

if there are no observations for a particular sizebin, is it ok to leave that sizebin out rather than have a row of NA (currently I leave it out)

if a predator does not eat any of our modeled species, should it not appear in this file? or do you want entries for each predator with NA for our modeled species diet proportion and a 1 under "all other prey"?

thanks!

sgaichas commented 3 years ago

Added an example diet proportion by predator/lengthbin in thefaylab hydradata repository commit https://github.com/thefaylab/hydradata/commit/3a82dc22d712109af3d58163839b5cb85d078190

in the data-raw folder, filename is observation_diets_NOBA_allsurvs.csv

it has the above properties: no rows for predators not eating any of our modeled species, and no rows for missing sizebins of predators that do eat our modeled species. there is a column for allotherprey at the end, this can be a very large proportion of diet!

I have not yet coded the surveys or species as numbers for consistency with previous files, but I can do that in one pass for all of them.

please let me know if you see anything strange. thanks!

gavinfay commented 3 years ago

These look great. Thank you! I have a very short dummy data set in the hydra_sim.dat test file in this repo, but will map this to a version that uses these NOBA data. Also thinking it is making more and more sense to have these data in a separate file from the default input data file that controls parameters etc. (e.g. separating the data from the model specs).

sgaichas commented 3 years ago

hydra_sim_NOBA.dat has been added with https://github.com/thefaylab/hydra_sim/commit/95583db5aeca2663f7375b6009420d91327019a0. I think we should probably separate the datafiles... diet data alone is a huge scroll through this file. Please let me know what you think. Documentation of building this is here; I used hydradata to write it so https://github.com/thefaylab/hydradata is now up to date with those changes.

there are a lot of dummy parameters in here that I don't think we need in hydra_est but maturity may not be one of them... so I can estimate those from Atlantis if we need them (I think only needed in the SSB calculation?)

still to do is the .pin file; need to dimension things correctly and then figure out what good start pars are

sgaichas commented 3 years ago

Should we add a column for "month" or other sub-annual timestep to catch and survey data files?

Simulated and real data are sub-annual. I was summing fishery data to the year and each survey is once per year but at a different time. I could instead leave fishery data seasonal (simulated @5 output timesteps per year).

Let me know what you think.

gavinfay commented 3 years ago

I don't think a month column is necessary. I was thinking that each survey has a 'timing', which is what determines when (sub-annually) the predicted value gets calculated. For the catches, the objective function sums over year to calculate the predicted value. If you think that fitting to data on a more granular level would help then we can do this, but I am not sure other model dynamics are representative of changes at these timescales?

sgaichas commented 3 years ago

sounds good--files remain as they were, but now updated with corrected NOBA run and an initial .pin file. the time series data have been moved to hydra_sim_NOBA-ts.dat. and the time series for fitting have been truncated to 80 years (no longer 144)

gavinfay commented 3 years ago

@sgaichas It looks like the new hydra_sim_NOBA.dat doesn't include some of the other changes I'd made to the data file (or at least the pointer to the time series file). Unfortunately, a recent MacOS update seems to have messed with my ADMB installation so I am currently rebuilding ADMB from source so that I can do some sleuthing with compilation. More soon!

gavinfay commented 3 years ago

Is clang named thusly because you clang your head against the desk repeatedly?

gavinfay commented 3 years ago

@sgaichas it looks like there are some variables near the top of hydra_sim_NOBA.dat that are still of length 144 instead of length 80 (Nyrs). These are:

# init_matrix recruitment_cov(1,Nrecruitment_cov,1,Nyrs) # init_matrix maturity_cov(1,Nmaturity_cov,1,Nyrs) # init_matrix growth_cov(1,Ngrowth_cov,1,Nyrs) # init_3darray obs_effort(1,Nareas,1,Nfleets,1,Nyrs)

Would change directly but suspect this needs to be a modification to the script that is generating the input file from the Atlantis output?

Suggest that the time series data file name is read in early on, say after the dimensioning variables?

So, perhaps after

# init_number wtconv

add

## Time series data file hydra_sim_NOBA-ts.dat

sgaichas commented 3 years ago

@gavinfay thanks; I found the dimensioning mistake in the script and corrected it.

The dat file function now adds the time series data file name as suggested, after wtconv

sgaichas commented 3 years ago

Outstanding issues being addressed today:

there are still 3 vectors in the .dat file that aren't being filled by the hydradata package so I'm fixing today
removing species names from the initial length composition in the .pin file

@gavinfay please add any other input file problems I've missed

sgaichas commented 3 years ago

two issues in above comment should be fixed in https://github.com/thefaylab/hydra_sim/commit/0bc52d51439519ecf33b73b922d3b5c4d6234915

thefaylab / hydra_sim

added proposed data structure #2