I made the mistake of going down the rabbithole of looking into data splitting algorithms. And there is a lot out there but no clear guidelines - i.e. there would we a need for a benchmark at some point ...
Duplex algorithm
The DUPLEX algorithm, developed by R. W. Kennard, is recommended for dividing the data into the estimation set and prediction set when there is no obvious variable such as time to use as a basis to split the data.
I made the mistake of going down the rabbithole of looking into data splitting algorithms. And there is a lot out there but no clear guidelines - i.e. there would we a need for a benchmark at some point ...
Duplex algorithm
https://www.jstor.org/stable/1267881#metadata_info_tab_contents
The algorithm is basically Kennard-Stone (greedy MaxMin) alternating between two sets.
Seems to have some users in cheminformatics and also R implementations.
SPlit: An Optimal Method for Data Splitting
Compares, among others, with DUPLEX
https://arxiv.org/pdf/2012.10945.pdf