isayev / ASE_ANI

ANI-1 neural net potential with python interface (ASE)
MIT License
220 stars 56 forks source link

Clarification on cross val splits? #13

Closed proteneer closed 6 years ago

proteneer commented 6 years ago

How are you guys doing the splits on the train/valid/test? Are you splitting on chemotypes (i.e. different molecules are lumped together) or splitting on conformations (all conformers are scattered randomly)? I.e. train/test/valid may have mixed chemotypes?

Jussmith01 commented 6 years ago

Splitting on conformers. In the ANI-1 paper we put all 22M conformers in a bucket. Then, from the bucket we pick 80% for training, 10% for valid, and 10% for test. This means there is a probability that some of the 57k configurations (from which the conformers were generated) have no conformers from the training set.

proteneer commented 6 years ago

Just so I understand - you're allowing a single chemotype (Eg. Methane, CH4), whose 1000 conformers, to be split such that 800 CH4 conformers is in train, 100 CH4 conformers is in valid, 100 CH4 conformers is in test? This seems a little strange since I imagine a better test of generalizibility is to split on the chemotype (i.e. given a collection of methane, ethane, propane, butane, etc.), it's split such that all of methane/ethane conformers are in train, all of propane is in valid, and all of butane is in test.

proteneer commented 6 years ago

Just to be clear again, we're doing validation such that we guarantee training/valid/test disjointly partition the 57k configurations with absolutely no overlap.