In introduction section, discuss the three types of data leakage. Might be helpful to show an example - show an image, an exact duplicate, a close duplicate, etc.
In introduction, talk about how duplicate samples might end up in the data.
reasons specific to the data and how it was collected. Example: training an email classifier and you get data from all students in an academic department, they will have received a lot of common emails (announcements etc).
Frankenstein data sets - example of COVIDX
Using LLM to generate data for training - it is likely to have duplicates. Prompt students to use an LLM of their choice to generate training data for a specific problem using a specific prompt, ask them to look for duplicates in the output.
Data augmentation or oversampling before split.
Example
Suggestion: use a simple linear regression example.
Generate some random data.
generate a finite data set (100 samples, make this a tunable parameter also) using some random distribution
generate the target variable using some "known" coefficients
sample from the finite data set with replacement (can be more than 100 samples) (first tunable parameter!)
compute the number of duplicates
add random noise to the sampled data set (second tunable parameter!)
divide intro training and test - compute the overlap between training and test (how many samples in test have a near-duplicate in training)
train a linear regression, evaluate on test set
then, generate a new "clean" test set using the same random distribution, generating the target variable using the same known coefficients, and adding the same level of random noise. evaluate the already fitted model on this "clean" test set.
compare the performance on the "bad" test set and the "clean" test set.
With the oversampling example - let's do it in as similar as possible a way as the first example. But I like that you show the correct way to do oversampling.
CIFAR-100 example
specific numbers: re duplicates in train, validation, test. show how to measure the extent of duplication.
Instead of this reference, the paper "Leakage and the reproducibility crisis in machine learning-based science" is prob more appropriate.
Put a "References" section at the end, give a full citation, and then use numerical references. (For all of the references!)
"Duplicates" image: give full citation to paper in the references section at the end, then in caption, write "From [X]".
Can't re-use verbatim text without making it clear that it's not your own, and attributing the original source.
Talk about "exact duplicate" and "near duplicate". Near duplicate includes more broadly images of the same sample, even if it's not the same source image.
Use simple language, only use technical detail when it's actually required.
Still need to "Prompt students to use an LLM of their choice to generate training data for a specific problem using a specific prompt, ask them to look for duplicates in the output."
In the Data augmentation part, give an example. Sample a few images, augment entire dataset, split into train and test, and show the near-duplicates in train and test.
Inline code is preferred over function calls. (generate_data is OK to keep as a function.)
in example, don't call it oversampling. instead describe it as: "Imagine that we have a dataset that accidentally has duplicates..."
(plans to update the oversampling example to match the first one.)
Name the two examples: "Example with accidental overlap between training and test set" and "Example with incorrect oversampling"
To the extent possible, in the "detecting duplicates" section - have code examples.
The Frankenstein datasets seem to be an example of "Reasons specific to the data and how it was collected". You shouldn't remove it but use it as an example for a scenario.
In the LLM example replace Mistral with a model that doesn't require login to access (try to keep everything within the notebook unless necessary)
Set the seed for everything to make sure the numbers are the same for every run
For the toy example, stick to one model for examples if possible (either LR or RF) to decrease the pre-requisites
For the exercise added, add cell with some missing code to write the correct implementation without data leakage.
Introduction
Example
With the oversampling example - let's do it in as similar as possible a way as the first example. But I like that you show the correct way to do oversampling.
CIFAR-100 example