tnc-br / ddf-isoscapes

3 stars 0 forks source link

Data Ingestion from USP. Define Training, Validation and Test Sets #115

Closed benwulfe closed 10 months ago

benwulfe commented 1 year ago

We need to align on the training, validation and test sets used across candidate models for consistency and proper validation.

In addition, since our validation technique requires multiple element analysis at the same lat/lon, we will want to prioritize locations that contain multiple elements to belong to the test set (barring any other statistical reasons for exclusion).

erickzul commented 1 year ago

Quick question: Do we want to consider all of the training data we've accumulated, or only UC David data?

benwulfe commented 1 year ago

All of the data -- with corrections applied to calibrate the data from UC Davis with previous analysis done at USP. My impression was the Gabriella was doing this (the Gabriella at USP). You will want to verify with Greta. If we need to do it, I believe we need to perform the following math to make the data aligned: 0.9775 * USP = UCD

so we multiply USP numbers by 0.9775 to calibrate them to UC Davis numbers

benwulfe commented 1 year ago

We can run without this near term. If we can have a split of the data sometime next week, that would be good? I think all we need to do is take the CSV provided and split it into three (training, validation, test), with the only caveats being: a) the test set should try to include locations that have all elements b) in xgboost code, the test set was geographically distinct to prevent overfit. I believe Nicholas cordoned off via the following lat/long coordinates: train = df[df["lon"] < -55] test = df[(df["lon"] >= -55) & (df["lat"] > -2.85)] validation = df[(df["lon"] >= -55) & (df["lat"] <= -2.85)] See https://github.com/tnc-br/ddf-isoscapes/blob/main/xgboost/partition.py for more info.

benwulfe commented 1 year ago

Note that from Greta's comments on the latest data (6/23 data), it already has the correction of 0.9775 applied. May want to confirm

erickzul commented 1 year ago

To establish a correct split for train/test/split, I wanted to understand the data a bit better. Doc

Some observations so far:

I plan to discuss this with Nic before moving forward.

Characteristics to validate in the resulting evaluation set (although the same may apply initially to the training set as well):

W.r.t outliers:

benwulfe commented 1 year ago

re: "Instead, we should choose to look at histograms and find shapes that may be noise"

can you explain what histogram you would use? The linked screen shows a histogram across latitudes, which I don't fully comprehend as indicative of an outlier.