jacksund / simmate

The Simulated Materials Ecosystem (Simmate) is a toolbox and framework for computational materials research.
https://simmate.org
BSD 3-Clause "New" or "Revised" License
29 stars 9 forks source link

"CrystalNN_fast" for evolutionary algorithm #387

Open scott-materials opened 1 year ago

scott-materials commented 1 year ago

Describe the desired feature

CrystalNN is our go-to for fingerprinting in the evolutionary algorithm and reverse monte carlo. If we look at the current code for this validator, it generates a vector of length 244. This is because the

cnnf.from_preset("ops", ... )

uses all 61 structural descriptors from CrystalNN, and we're calculating four different statistics for each:

stat_options: list[str] = ["mean", "std_dev", "minimum", "maximum"]

I want to have an alternate version of this function, "CrystalNN_fast" that will accelerate database queries and be compatible with the default versions of Postgres Cube and Digital Ocean. I believe we can use a version of the CrystalNN fingerprint that uses a vector of length 49. Here's how.

1) Only calculate the mean.

Because we will only query those structures that have identical stoichiometries, this will immediately reduce the scope of the search space. The probability of duplicates should still be extremely small even without the std_dev, min, and max.

2) Only calculate linearly independent vectors.

By default, the preset ops generates duplicate information. For example, for coordination number 8, ops calculates these:

8: ['body-centered cubic', 'hexagonal bipyramidal'],

but it also calculates a value called wt_8 that is simply the sum of the first two vectors. Since wt_8 has no additional descriptive information, we should exclude it . This unnecessary information occurs for all coordination numbers between 1 and 12. Therefore, by excluding these 12, we can drop the 61 vectors down to 49. Hooray!

Here's how we do it:

geometries = {
1: ['sgl_bd'],
2: ['L-shaped', 'water-like', 'bent 120 degrees', 'bent 150 degrees', 'linear'],
3: ['trigonal planar', 'trigonal non-coplanar', 'T-shaped'],
4: ['square co-planar', 'tetrahedral', 'rectangular see-saw-like', 'see-saw-like', 'trigonal pyramidal'],
5: ['pentagonal planar', 'square pyramidal', 'trigonal bipyramidal'],
6: ['octahedral', 'pentagonal pyramidal', 'hexagonal planar'],
7: ['hexagonal pyramidal', 'pentagonal bipyramidal'],
8: ['body-centered cubic', 'hexagonal bipyramidal'],
9: ['q2', 'q4', 'q6'],
10: ['q2', 'q4', 'q6'],
11: ['q2', 'q4', 'q6'],
12: ['cuboctahedral', 'q2', 'q4', 'q6'],
13: ['wt'],
14: ['wt'],
15: ['wt'],
16: ['wt'],
17: ['wt'],
18: ['wt'],
19: ['wt'],
20: ['wt'],
21: ['wt'],
22: ['wt'],
23: ['wt'],
24: ['wt'],
}

cnnf_method = CNNF(geometries, distance_cutoffs=None, x_diff_weight=0)
stat_options = ["mean"]
stats_method = SSF(cnnf_method, stats=stat_options)

stats = stats_method.featurize(some_new_structure)

@jacksund what do you think? FYI I gave Gabe one million of these 49-length vectors for testing.

Additional context

No response

To-do items

No response

jacksund commented 1 year ago

Simple enough! We can update the default featurizer in this file: https://github.com/jacksund/simmate/blob/main/src/simmate/toolkit/validators/fingerprint/pcrystalnn.py

You can see that is the default class for the evo search here: https://github.com/jacksund/simmate/blob/main/src/simmate/apps/evolution/workflows/fixed_composition.py#L53-L57