sirmarcel / cmlkit

tools for machine learning in condensed matter physics and quantum chemistry
MIT License
34 stars 6 forks source link

Edge case in Dataset.compute_dataset_info #7

Open sxie22 opened 4 years ago

sxie22 commented 4 years ago

This is a minor concern, especially if Dataset is being phased out. When creating a Dataset with only single-atom geometries, dists is a list of empty arrays and min_distance and max_distance fail. A real-world example of this might be energy-volume curves with FCC primitive cells.

https://github.com/sirmarcel/cmlkit/blob/539021520de316e94c1490f957f923ab41bb18c3/cmlkit/dataset/dataset.py#L548-L552

Toy example:

from ase import Atoms
from cmlkit.dataset import Dataset

geometries = [Atoms('Au', cell=[3, 3, 3], pbc=True),
              Atoms('Au', cell=[4, 4, 4], pbc=True),
              Atoms('Au', cell=[5, 5, 5], pbc=True),]
Dataset.from_Atoms(geometries)

ValueError                                Traceback (most recent call last)
<ipython-input-64-25473c66c8a8> in <module>
      5               Atoms('Au', cell=[3, 3, 3], pbc=True),
      6               Atoms('Au', cell=[4, 4, 4], pbc=True),]
----> 7 Dataset.from_Atoms(geometries)

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in from_Atoms(cls, atoms, p, name, desc, splits)
    288             name=name,
    289             desc=desc,
--> 290             splits=splits,
    291         )
    292 

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in __init__(self, z, r, b, p, name, desc, splits, _info, _hash, _geom_hash)
    169             self.info = _info
    170         else:
--> 171             self.info = self.get_info()
    172 
    173         # compute auxiliary info that we need to convert properties

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in get_info(self)
    218     def get_info(self):
    219         """Compute information on dataset."""
--> 220         return compute_dataset_info(self)
    221 
    222     def get_hash(self):

~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in compute_dataset_info(dataset)
    549         qmmlpack.lower_triangular_part(qmmlpack.distance_euclidean(rr), -1) for rr in r
    550     ]
--> 551     i["min_distance"] = min([min(d) for d in dists if len(d) > 0])
    552     i["max_distance"] = max([max(d) for d in dists if len(d) > 0])
    553 

ValueError: min() arg is an empty sequence
sirmarcel commented 4 years ago

Thanks for reporting! This is indeed an edge case, but I'll try to fix it when I overhaul Dataset.