Dataset split fails - Githubissues

lab-cosmo / metatrain

Training and evaluating machine learning models for atomistic systems.

https://lab-cosmo.github.io/metatrain/

BSD 3-Clause "New" or "Revised" License

14 stars 3 forks source link

Dataset split fails #290

Closed frostedoyster closed 1 month ago

frostedoyster commented 1 month ago

I've already seen this error quite a few times. It happens with moderately large to very large datasets and simply changing the train/valid/test sizes by a tiny bit fixes it

Traceback (most recent call last):
  File "/base/lib/python3.12/site-packages/metatrain/__main__.py", line 100, in main
    train_model(**args.__dict__)
  File "/base/lib/python3.12/site-packages/metatrain/cli/train.py", line 285, in train_model
    train_dataset_new, val_dataset = _train_test_random_split(
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/base/lib/python3.12/site-packages/metatrain/utils/data/dataset.py", line 476, in _train_test_random_split
    raise ValueError(
ValueError: Sum of input lengths does not equal the length of the input dataset!

PicoCentauri commented 1 month ago

Hmm might be a rounding error. Do you maybe have a example input file that you can share?

frostedoyster commented 1 month ago

Here is an example: you first have to expand the small ethanol dataset like this:

import ase.io

structures = ase.io.read("ethanol_reduced_100.xyz", ":")
more_structures = structures * 15 + [structures[0]]
ase.io.write("ethanol_1501.xyz", more_structures)

and then run training with this options file:

seed: 42

architecture:
  name: experimental.soap_bpnn
  training:
    batch_size: 2
    num_epochs: 1

training_set:
  systems:
    read_from: ethanol_1501.xyz
    length_unit: angstrom
  targets:
    energy:
      key: energy
      unit: eV

test_set: 0.85
validation_set: 0.01

PicoCentauri commented 1 month ago

Thanks I will look into this.