Train test data - Githubissues

jzwart commented 4 years ago

Building training and test datasets for the DRB stream network for both synthetic observations (SNTemp-generated) and real observations.

Addressing the following questions

Which sampling routine is best for PGDL accuracy (using SNTemp uncalibrated output as truth)? generated here
How does PGDL compare to calibrated SNTemp? How does performance change with varying data amounts for training? generated here See more notes on building training / test datasets here

Since the synthetic data is quite large (~1Gb file for full dataset), I only create train/test scheme with seg_id_nat and date in the dataframe - the schemes will have to be joined with the synthetic data output which Xiaowei already has. The data frame would look like this: where each experiment using the sampling scheme randomizes what dates/segments are used for training in the first 24 years of the dataset (the last 12 years are withheld for testing only).

The real observation data set sampling scheme just uses a randomized sampling of XX% of data from the first 24 years of the dataset to use for training and the last 12 years are used for testing only. See the notes for discussion of other sampling routines (I can update this PR with new routines). The data frame for real observations would look like this: For the full network, the number of observations / day increases with time: where the colors are the percent of observations giving for training (i.e. 50 == 50% obs within first 24 years used for training)

There are some issues with the current randomized scheme as most observations are in only a few highly monitored segments (90% of the training observations occur in only 7% of the segments). For example, autocorrelation in data would combat the data sparsity challenge if randomly withheld.

jzwart commented 4 years ago

@aappling-usgs @jsadler2 @jread-usgs @limnoliver could you take a look at this train/test scheme so far

jordansread commented 4 years ago

This looks thoughtful to me. I wonder though if it is worth also including the small python translator code snippet to go from your pared down seg_id_nat scheme (when using SNTemp as test/train) into the full matrix that is equivalent to the larger (1 GB) data file. Suggesting this because that translation is critical to get right for reproducibility and having confidence that it is carried out as intended would be :+1:

or - bit the bullet and share the 1GB file through drive (or a collaborative USGS/U-MN sciencebase item), since it isn't that big.

aappling-usgs commented 4 years ago

The data.frames have train or test in each cell including dates before the year 2000, but your text indicates that it's the last 12 years that will be used for testing. Should those test cells actually be ignore or similar?

aappling-usgs commented 4 years ago

I don't think you should be committing files from .remake/objects.

jzwart commented 4 years ago

@jread-usgs , yeah these would be shared through drive - this repo is connected to my personal google account and I'm running out of space which is why I was suggesting only seg_id_nat and date :) I could move elsewhere or give code snippet of linking together. However, we aren't calibrated with any of that data since that is treating SNTemp as truth and PGDL is emulating (not sure if that makes a difference because only PGDL will be using that data to calibrate).

@aappling-usgs , good point I'll change the first 24 year cells from test to ignore

jzwart commented 4 years ago

Update and final scheme: The first 24 years are used for training, and a given segment-date will either be used for training, indicated as train in the experiment columns, or ignored, indicated as ignore. All the data from the last 12 years will be used for testing, indicated as test in the experiment columns.

The training and test schemes for the synthetic data should be merged with the synthetic data output by joining by seg_id_nat and date - I have some example Python code here

The real observations training / test scheme again uses the first 24 years of data as training and the last 12 as testing. However, to make the training more of a challenge for PGDL and to reduce the chance of autocorrelation in the observation data, we made the maximum number of observation any segment can use for training as 100 and only keep those segments that have at least 100 observations (this ends up being 65 segments for the full DRB and 8 segments for the subsetted DRB). Any of these segments can have a random set of observations ranging from 2 to 100 obs, and this is used as a data sparsity test for the PGDL and calibrated SNTemp. So the max # of obs that the subsetted network will be trained on is 800 (100 obs for each of the 8 segments) and the minimum is 16 observations (2 for each of the segments).

Here's an example of training observations used from the different schemes

jzwart commented 4 years ago

I'm merging this but we can iterate on train - test scheme later after Xiaowei works with these schemes

jzwart / delaware-water-temp

Train test data #4