Closed jzwart closed 4 years ago
@aappling-usgs @jsadler2 @jread-usgs @limnoliver could you take a look at this train/test scheme so far
This looks thoughtful to me. I wonder though if it is worth also including the small python translator code snippet to go from your pared down seg_id_nat
scheme (when using SNTemp as test/train) into the full matrix that is equivalent to the larger (1 GB) data file. Suggesting this because that translation is critical to get right for reproducibility and having confidence that it is carried out as intended would be :+1:
The data.frames have train
or test
in each cell including dates before the year 2000, but your text indicates that it's the last 12 years that will be used for testing. Should those test
cells actually be ignore
or similar?
I don't think you should be committing files from .remake/objects
.
@jread-usgs , yeah these would be shared through drive - this repo is connected to my personal google account and I'm running out of space which is why I was suggesting only seg_id_nat
and date
:) I could move elsewhere or give code snippet of linking together. However, we aren't calibrated with any of that data since that is treating SNTemp as truth and PGDL is emulating (not sure if that makes a difference because only PGDL will be using that data to calibrate).
@aappling-usgs , good point I'll change the first 24 year cells from test
to ignore
Update and final scheme:
The first 24 years are used for training, and a given segment-date will either be used for training, indicated as train
in the experiment columns, or ignored, indicated as ignore
. All the data from the last 12 years will be used for testing, indicated as test
in the experiment columns.
The training and test schemes for the synthetic data should be merged with the synthetic data output by joining by seg_id_nat
and date
- I have some example Python code here
The real observations training / test scheme again uses the first 24 years of data as training and the last 12 as testing. However, to make the training more of a challenge for PGDL and to reduce the chance of autocorrelation in the observation data, we made the maximum number of observation any segment can use for training as 100 and only keep those segments that have at least 100 observations (this ends up being 65 segments for the full DRB and 8 segments for the subsetted DRB). Any of these segments can have a random set of observations ranging from 2 to 100 obs, and this is used as a data sparsity test for the PGDL and calibrated SNTemp. So the max # of obs that the subsetted network will be trained on is 800 (100 obs for each of the 8 segments) and the minimum is 16 observations (2 for each of the segments).
Here's an example of training observations used from the different schemes
I'm merging this but we can iterate on train - test scheme later after Xiaowei works with these schemes
Building training and test datasets for the DRB stream network for both synthetic observations (SNTemp-generated) and real observations.
Addressing the following questions
Since the synthetic data is quite large (~1Gb file for full dataset), I only create train/test scheme with
where each experiment using the sampling scheme randomizes what dates/segments are used for training in the first 24 years of the dataset (the last 12 years are withheld for testing only).
seg_id_nat
anddate
in the dataframe - the schemes will have to be joined with the synthetic data output which Xiaowei already has. The data frame would look like this:The real observation data set sampling scheme just uses a randomized sampling of XX% of data from the first 24 years of the dataset to use for training and the last 12 years are used for testing only. See the notes for discussion of other sampling routines (I can update this PR with new routines). The data frame for real observations would look like this:
For the full network, the number of observations / day increases with time:
where the colors are the percent of observations giving for training (i.e. 50 == 50% obs within first 24 years used for training)
There are some issues with the current randomized scheme as most observations are in only a few highly monitored segments (90% of the training observations occur in only 7% of the segments). For example, autocorrelation in data would combat the data sparsity challenge if randomly withheld.