kjappelbaum / mofdscribe

An ecosystem for digital reticular chemistry
https://mofdscribe.readthedocs.io/en/latest/
MIT License
43 stars 8 forks source link

other splitting strategies (implement `DUPLEX` algorithm?) #241

Open kjappelbaum opened 2 years ago

kjappelbaum commented 2 years ago

I made the mistake of going down the rabbithole of looking into data splitting algorithms. And there is a lot out there but no clear guidelines - i.e. there would we a need for a benchmark at some point ...

Duplex algorithm

The DUPLEX algorithm, developed by R. W. Kennard, is recommended for dividing the data into the estimation set and prediction set when there is no obvious variable such as time to use as a basis to split the data.

https://www.jstor.org/stable/1267881#metadata_info_tab_contents

The algorithm is basically Kennard-Stone (greedy MaxMin) alternating between two sets.

Seems to have some users in cheminformatics and also R implementations.

SPlit: An Optimal Method for Data Splitting

Compares, among others, with DUPLEX

https://arxiv.org/pdf/2012.10945.pdf

kjappelbaum commented 2 years ago

should read

kjappelbaum commented 2 years ago

here's some more info. Also discussing DUPLEX and the original CADEX However, no strong conclusions.

kjappelbaum commented 2 years ago

also realizing that the prospectr vignette is really good

kjappelbaum commented 2 years ago

i think it should be relatively easy to implement, based on our CADEX implementation - however, i do not know if there is any use for it