Data Ingestion from USP. Define Training, Validation and Test Sets

benwulfe commented 1 year ago

We need to align on the training, validation and test sets used across candidate models for consistency and proper validation.

In addition, since our validation technique requires multiple element analysis at the same lat/lon, we will want to prioritize locations that contain multiple elements to belong to the test set (barring any other statistical reasons for exclusion).

erickzul commented 1 year ago

Quick question: Do we want to consider all of the training data we've accumulated, or only UC David data?

benwulfe commented 1 year ago

All of the data -- with corrections applied to calibrate the data from UC Davis with previous analysis done at USP. My impression was the Gabriella was doing this (the Gabriella at USP). You will want to verify with Greta. If we need to do it, I believe we need to perform the following math to make the data aligned: 0.9775 * USP = UCD

so we multiply USP numbers by 0.9775 to calibrate them to UC Davis numbers

benwulfe commented 1 year ago

We can run without this near term. If we can have a split of the data sometime next week, that would be good? I think all we need to do is take the CSV provided and split it into three (training, validation, test), with the only caveats being: a) the test set should try to include locations that have all elements b) in xgboost code, the test set was geographically distinct to prevent overfit. I believe Nicholas cordoned off via the following lat/long coordinates: train = df[df["lon"] < -55] test = df[(df["lon"] >= -55) & (df["lat"] > -2.85)] validation = df[(df["lon"] >= -55) & (df["lat"] <= -2.85)] See https://github.com/tnc-br/ddf-isoscapes/blob/main/xgboost/partition.py for more info.

benwulfe commented 1 year ago

Note that from Greta's comments on the latest data (6/23 data), it already has the correction of 0.9775 applied. May want to confirm

erickzul commented 1 year ago

To establish a correct split for train/test/split, I wanted to understand the data a bit better. Doc

Some observations so far:

Both training data and evaluation data should have the same distribution as the target traffic. Before not all sites were present in the evaluation data to avoid overfitting, but this changes what you're measuring (one site vs all that you want to know about) which is not reflective of end use case. You can make adjustments to training data to avoid overfitting, which is the more common practice.
We may not have enough data to make a train/test/validation split and should look into splitting into train/test only.

I plan to discuss this with Nic before moving forward.

Characteristics to validate in the resulting evaluation set (although the same may apply initially to the training set as well):

Latitude and longitude distributions should be preserved after split
Species, genus and family feature, making sure that the tail values of each column appear at least once
Isotope readings: preserve mean values considering variance will change

W.r.t outliers:

Given the small sample size and high variance of the data, we should not remove outliers based on their deviation to one singular mean.
Instead, we should choose to look at histograms and find shapes that may be noise https://screenshot.googleplex.com/4XyYv7fDYC5nkQW

benwulfe commented 1 year ago

re: "Instead, we should choose to look at histograms and find shapes that may be noise"

can you explain what histogram you would use? The linked screen shows a histogram across latitudes, which I don't fully comprehend as indicative of an outlier.

tnc-br / ddf-isoscapes

Data Ingestion from USP. Define Training, Validation and Test Sets #115