Instead of using absolute accuracy metrics (0%, 100%) as the optimal baselines for tasks, it would be helpful to have dataset+task-specific baselines that help calibrate results to the difficulty of the particular task.
The truly optimal baseline would be to find the nearest training point (under the specified metric) to each test point and use that as an epsilon upper bound. This is computationally infeasible for larger datasets, but an approximate nearest neighbor approach could work suitably. I see this being of particular interest for certified defenses.
Instead of using absolute accuracy metrics (0%, 100%) as the optimal baselines for tasks, it would be helpful to have dataset+task-specific baselines that help calibrate results to the difficulty of the particular task.
The truly optimal baseline would be to find the nearest training point (under the specified metric) to each test point and use that as an epsilon upper bound. This is computationally infeasible for larger datasets, but an approximate nearest neighbor approach could work suitably. I see this being of particular interest for certified defenses.