mahmoodlab / CLAM

Open source tools for computational pathology - Nature BME
http://clam.mahmoodlab.org
GNU General Public License v3.0
1.13k stars 365 forks source link

The split created using Create Split Seq may confuse the Training, Validation, and Test sets, leading to data leakage. #269

Closed eternalld closed 2 months ago

eternalld commented 2 months ago

in the current official tumor vs. Normal example provided in the repository, if Slide_13 is used as Test in Splits0 but then used as Training data in Splits1, this will indeed result in data leakage. Having the same data point appear in both the training and test sets across different splits can skew the evaluation results, leading to over-optimistic performance and rendering the evaluation results inaccurate.

To ensure accurate evaluation results, it is critical that no data point is shared between the training, validation, and test sets, even across different splits. If a slide is used for testing in one split, it should not appear in the training or validation sets in any other splits.

image image image