kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
18 stars 1 forks source link

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case. #19

Closed EasternCaveMan closed 8 months ago

EasternCaveMan commented 8 months ago

Dear Roman, I split my data by I1f method, which makes the test set size=0.1

datasail --f-type P --f-data All_sequences.fasta --f-sim cdhit --output split_SCIP_I1f --techniques I1f --splits 0.8 0.2  --names train  test  --solver SCIP  --f-args "-M 10000" --to-sec 1000 

I got this error during the plotting AUC curve

File "/scratch/SCRATCH_SAS/vahid/ESP/notebooks_and_code/TGBM_EM1bts_ECFP.py", 
line 121, in <module>   roc_auc = roc_auc_score(np.array(test_y), bst.predict(dtest))  File "/home/vat23/miniconda3/envs/ESP/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", 
line 572, in roc_auc_score   return _average_binary_score(  File "/home/vat23/miniconda3/envs/ESP/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score   return binary_metric(y_true, y_score, sample_weight=sample_weight)  File"/home/vat23/miniconda3/envs/ESP/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", 
line 339, in _binary_roc_auc_score   raise ValueError( ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

when I set the --epsilon 0.0 the test set size =0.2 and, I didnt get this error during plotting ROC AUC curve.

Old-Shatterhand commented 8 months ago

Hey @EasternCaveMan, DataSAIL assigns data points within the epsilon error margin to the split. That is a hard constraint. In the solution space within these hard constraints, DataSAIL optimizes splits based on the weighting of data points and similarities between data points. Depending on the dataset it may happen that the test set only contains datapoints with one label.

For now, you can use the --runs option to create multiple splits and check for one where samples from both classes are present in the test set. From version 1.0.0 on you can specify stratification to balance classes in each split.If you install DataSAIL from source (branch dev_1.0) you can already use a beta-version of it (it's not fully tested and documented yet).

Best, Roman