cheaper matching - Githubissues

kjappelbaum commented 2 years ago

pymatgen's structure matcher is relatively slow. It might make sense to "pre-screen" (or at least give the option to do so) based on some cheaper heuristic.

If you consider identity to be a match of structure graphs, one might implement such a cheaper via graph hashes (I did this for instance here https://github.com/kjappelbaum/mofchecker, because for MOF unit cells the Structure Matcher is painfully slow). However, there are no theoretical guarantees, you rely on the bond heuristic and the implementation is not 100% fool-proof at periodic boundaries (but this can be improved).

Anyhow, It might also be interesting to have some finer-grained resolution on the matching:

composition
bonding
Euclidean embedding (i.e. coordinates, e.g. via the Kabsch RMSD)

https://github.com/sparks-baird/matbench-genmetrics/blob/5f26ff369ca4e2e6d13a8803d5835f551b6025fb/src/matbench_genmetrics/core.py#L69

sgbaird commented 2 years ago

Great points. Pre-screening and granularity for the matching would both make a lot of sense. The latter especially would offer more interpretability to the metrics. The 4 metric types (validity, coverage, novelty, and uniqueness) are also easily transferred to the different categories (e.g. val, cov, nov, uniq for composition). For composition, Element Mover's Distance via chem_wasserstein with some threshold would probably work well. For the Euclidean embedding, Kabsch RMSD with a threshold also seems like it would work. Any thoughts on how bonding matching would be implemented?

I think for structure validity CDVAE just looked at whether the shortest distance between any two atoms is larger than 0.5 Angstroms.

EDIT: after playing around a bit with the current implementation, I agree that it's prohibitively expensive to use StructureMatcher. Computing a ~$5000 \times 1000$ match matrix for a single metric might take around 100 minutes. Another alternative would be a crystal distance metric (e.g. CrystalNN) which is probably closer to the bonding matching you mentioned. Computing the three categories you mentioned probably would be a lot faster than using StructureMatcher. Likewise, pre-screening would also probably help a great deal with speedup.

EDIT: It took ~4 hours to run all the metrics across the 5 mp-time-split folds for 100 generated structures. So probably ~40 hours for 1000 structures.

kjappelbaum commented 2 years ago

. Any thoughts on how bonding matching would be implemented?

I use https://github.com/kjappelbaum/structuregraph-helpers/blob/main/src/structuregraph_helpers/hash.py to remove duplicates. This implementation will give you an upper bound on the matches. One could then find the "truly" matching ones by doing the "real" isomorphism check.

sgbaird commented 2 years ago

On a somewhat related note, I could also try to reuse match matrices calculated for earlier folds in later ones. This would reduce the computational burden by maybe a factor of 2-3. Would kind of convolute the code though and probably require a lot of logic checks.

sgbaird commented 2 years ago

To pull in some additional perspectives, I created this matsci post.

mkhorton commented 2 years ago

Faster StructureMatching would be a dream. It's unclear whether a better strategy is to optimize the pymatgen code (I suspect it could probably be optimized significantly), or find an alternative method (e.g., structural fingerprinting, for a more heuristic-based approach). This has certainly been the limiting factor in some of my own projects.

Last time I had chance to look at this myself I created this PR: https://github.com/materialsproject/pymatgen/pull/1654 but was not sufficiently convinced I could improve -- or even, if I'm honest, where the largest bottlenecks are without doing additional profiling.

mkhorton commented 2 years ago

I suspect someone confident in writing Cython and/or using numba, could write a much faster version, and it would reward careful attention. It should be noted the current pymatgen code does use some custom Cython however.

Otherwise, I think if I needed a "good-enough" solution for a project, I might use one of the kernels offered by dscribe. These are very fast, but are not as strict (due to distance cut-offs used etc.), nor as battle tested (at least for our specific use cases), so I imagine there might be some unfavorable edge cases. That said, overall, I like them a lot and recommend them frequently.

Finally, another strategy I've used for coarse graining, is to compare by spacegroup first when I want to structure match (e.g., one calculated with a fine and a looser simprec, and if either match...), but I'm not wholly convinced by this either unfortunately! It is fast though :)

mkhorton commented 2 years ago

Actually -- some additional notes :)

I think it's important to consider the use case carefully. Materials Project relies on StructureMatcher heavily as it essentially codifies our definition of a "material". Therefore it's very important that it's correct, consistent, and reliable -- moreso than speed. But for many applications, this is not the priority.
I also wanted to note AFLOW-XtalFinder. I have not evaluated this myself, but it's a newer effort in this arena, and would be interested if it's competitive.

sgbaird commented 2 years ago

@mkhorton really appreciate your comments!

Faster StructureMatching would be a dream. It's unclear whether a better strategy is to optimize the pymatgen code (I suspect it could probably be optimized significantly), or find an alternative method (e.g., structural fingerprinting, for a more heuristic-based approach). This has certainly been the limiting factor in some of my own projects.

This is really helpful feedback, as it seems like a common issue.

Last time I had chance to look at this myself I created this PR: materialsproject/pymatgen#1654 but was not sufficiently convinced I could improve -- or even, if I'm honest, where the largest bottlenecks are without doing additional profiling.

Profiling the function does seem like a good next step. Thanks for linking this!

I suspect someone confident in writing Cython and/or using numba, could write a much faster version, and it would reward careful attention. It should be noted the current pymatgen code does use some custom Cython however.

I'm tempted to numba-fy StructureMatcher and expose it in the dist-matrix package with CPU (and possibly GPU) compatibility. JAX comes to mind as well. The GPU-compatible numba implementation was a lot of work, but I ended up with well over 100x speedup of pairwise distance matrices compared to scipy.stats.wasserstein_distance. A $10000 \times 10000$ distance matrix on the order of 10 seconds.

Otherwise, I think if I needed a "good-enough" solution for a project, I might use one of the kernels offered by dscribe. These are very fast, but are not as strict (due to distance cut-offs used etc.), nor as battle tested (at least for our specific use cases), so I imagine there might be some unfavorable edge cases. That said, overall, I like them a lot and recommend them frequently.

I'm glad you mentioned these. I think I've come across this, but it carries a lot more weight having a recommendation for it.

Finally, another strategy I've used for coarse graining, is to compare by spacegroup first when I want to structure match (e.g., one calculated with a fine and a looser simprec, and if either match...), but I'm not wholly convinced by this either unfortunately! It is fast though :)

I think I'm with you there. Fast and relatively intuitive, but not as rigorous.

I think it's important to consider the use case carefully. Materials Project relies on StructureMatcher heavily as it essentially codifies our definition of a "material". Therefore it's very important that it's correct, consistent, and reliable -- moreso than speed. But for many applications, this is not the priority.

Great point. I think the rigor of StructureMatcher is a big appeal, but like you mentioned for many applications the rigor doesn't necessarily justify the computational cost. WIll give the numba implementation some more thought.

2. I also wanted to note AFLOW-XtalFinder. I have not evaluated this myself, but it's a newer effort in this arena, and would be interested if it's competitive.

Will take a look!

sgbaird commented 2 years ago

from AFLOW-XtalFinder manuscript:

Crystallographic symmetry is neglected in Structure Matcher

Any comments on what they mean by this?

EDIT: I found this from the manuscript:

Comparisons with Structure Matcher are less accurate — and at times qualitatively incorrect — due to conversions of structures to an “average lattice”11, matching significantly differing lattices with no penalty on the rms value. For example, Se (ICSD #104187, space group #229) and Se (ICSD #57181, space group #166) are classified as distinct by XtalFinder because the lattices are considerably dissimilar (ϵlatt = 0.15), consistent with the space groups. Despite having different symmetries, Structure Matcher inaccurately finds rms = 0 between the structures.

Also, using group_structures would probably be more efficient than my current implementation (unless I'm misinterpreting something), although it would require some refactoring to replace (or rather, reconstruct) the match matrix.

sgbaird commented 2 years ago

from internal discussion, by @kyledmiller

Pymatgen StructureMatcher.group_structures ==> (give it a group of cifs, it groups them by similarity)

Pro: More flexible, can associate structures which are very similar but with small, symmetry-breaking distortions

Con: Because of the aforementioned flexibility, matching corresponding Wyckoff sites between structures is nontrivial

Con: Doesn't assign unique/stable identifiers/to the structure groups

Con: This doesn't fit the purpose described by James very well unless you already have a set of all the known structures on hand

AFLOW XtalFinder --compare2database or --compare2prototypes ==> checks structure against various databases of isopointal prototypes (there's a python wrapper for it too)

Con: Less flexible, associated structures must be in the same space group, have the same occupied Wyckoff sites,

Pro: Wyckoff sites equivalencies within any given structural prototype group is built-in

Pro: Unique, stable identifiers (prototypes) assigned to the structure group

Comments

StructureMatcher

Pro: More flexible, can associate structures which are very similar but with small, symmetry-breaking distortions

For generative models, especially when symmetry isn't directly imposed, this seems like an important feature. Related: https://github.com/qzhu2017/PyXtal/issues/199. Forcing models to respect a given symmetry seems like a better route but a lot less straightforward. I don't think even CDVAE, which I consider state-of-the-art for crystal generative models, enforces symmetry explicitly.

Con: Because of the aforementioned flexibility, matching corresponding Wyckoff sites between structures is nontrivial

Is the main issue here that it's computationally expensive?

Con: Doesn't assign unique/stable identifiers/to the structure groups

Is the idea here that we get nice groupings of structures similar to how we can request all chemical formulas from a given chemical system? Is there any benefit in terms of speed-up or is the main benefit provenance and consistency?

Con: This doesn't fit the purpose described by James very well unless you already have a set of all the known structures on hand

For the current implementation of matbench-genmetrics, we take a train/test split and a set of generated structures, and compare train vs. generated (novelty), test vs. generated (coverage), and generated vs. generated (uniqueness).

XtalFinder

Con: Less flexible, associated structures must be in the same space group, have the same occupied Wyckoff sites,

This is really good to know. I don't think I would have caught that nuance immediately. Kind of goes back to @mkhorton's https://github.com/sparks-baird/matbench-genmetrics/issues/9#issuecomment-1199754866.

sgbaird commented 2 years ago

I found this note about speed comparisons by digging into the XtalFinder manuscript. IIUC, XtalFinder will be a lot faster for the use-case of comparing large sets of structures with each other because pairwise comparisons are generally ~3x faster, many fewer computations are used for comparisons of large sets, and it has built-in facilities for parallelization. Naturally at the expense of non-matching for small symmetry-breaking distortions mentioned earlier.

For a 1,494 pairwise comparison test set, XtalFinder averaged 282 milliseconds/comparison, Structure Matcher averaged 689 milliseconds/comparison, and XTALCOMP averaged 33 milliseconds/comparison. The test set is comprised of ICSD unaries (original geometries) with 1-105 atoms per unit cell and varying symmetries; along with a mix of equivalent and inequivalent structures. XTALCOMP is the fastest, but at the cost of limited scope and functionality: XTALCOMP does not scale volumes and quits immediately if the volumes/lattice vectors are different sizes. Therefore, XTALCOMP finds fewer matches (18), while XtalFinder and Structure Matcher find more (approximately 450). For large, skewed cells, XtalFinder can be slower since it does not convert to Minkowski, Niggli, and/or primitive cells by default to preserve the input representations (unlike Structure Matcher and XTALCOMP). To increase speed in XtalFinder, lattice transformations are available with the relevant options for Minkowski (--minkowski), Niggli (--niggli), and primitive (--primitive) reductions. ... While all benchmarks were performed serially, XtalFinder routines are parallelized and users can specify the number of threads for the analyses (--np=x), offering additional speed over other packages for large-scale automatic comparisons requiring little or no user input. Therefore, XtalFinder will be more performant, especially when comparing more structures.

Note about scalability:

For multiple comparisons, XtalFinder scales more efficiently with the number of compounds when compared to other software. XtalFinder groups structures into near isoconfigurational sets via symmetry and local atomic geometries (both calculated internally), eliminating unnecessary mapping comparisons between dissimilar structures. All other codes do not use symmetry or local geometry analyses to optimize groupings. Structure Matcher — and other straightforward extensions to pairwise comparisons — only groups by composition and executes more mapping procedures. For an ensemble of 600 structures modeling a disordered 5-metal carbide61,62, XtalFinder partitions immediately into 54 groups via the symmetry and local geometry comparisons, while Structure Matcher puts all 600 structures into one large group. Consequently, XtalFinder executes 546 structure mappings, and Structure Matcher performs 17,640 mapping attempts before arriving at the same solution.

Didn't realize that AFLOW is only directly available on MacOS and Linux. I've still been holding out developing packages that support Windows users despite much grief..

mkhorton commented 2 years ago

Yes, I was intrigued too at reading that -- but I haven't had time to review the code or test at scale, so I balance with cautious skepticism. Note the parallelization is neither here nor there, since StructureMatcher is trivially parallelized too, as is the pre-grouping by symmetry which we do regularly, so I'm not sure the comparison is actually appropriately made.

To put it another way, if you just use StructureMatcher.group_structures() on a list of, say, 600 structures, without any pre-filtering or grouping, of course there will be many redundant and un-needed comparisons.

sgbaird commented 2 years ago

I agree about it being trivially parallelized, e.g. via RayTune. Perhaps the easiest form of pre-filtering is via comparing reduced chemical formulas. IIUC, use of ElementComparator in StructureMatcher means that if the reduced chemical formula doesn't match, there's no way for the structures to match. Since most of the chemical formulas won't match, the number of comparisons can be dramatically reduced. Seem sound?

EDIT: looks like that's already taken care of in StructureMatcher: https://github.com/materialsproject/pymatgen/blob/baf62b77788fc43387e15b1e0b60094132815c47/pymatgen/analysis/structure_matcher.py#L599-L602

        if not self._subset and self._comparator.get_hash(struct1.composition) != self._comparator.get_hash(
            struct2.composition
        ):
            return False

sgbaird commented 2 years ago

Otherwise, I think if I needed a "good-enough" solution for a project, I might use one of the kernels offered by dscribe. These are very fast, but are not as strict (due to distance cut-offs used etc.), nor as battle tested (at least for our specific use cases), so I imagine there might be some unfavorable edge cases. That said, overall, I like them a lot and recommend them frequently.

@mkhorton were you referring to e.g. AverageKernel or e.g. Coulomb matrix? I figured you were referring to the options of AverageKernel or REMatchKernel, but I wanted to clarify since Coulomb matrix and MBTR are mentioned at the beginning.

EDIT: also noticing that dscribe is also only supported on UNIX platforms, similar to XtalFinder

mkhorton commented 2 years ago

EDIT: also noticing that dscribe is also only supported on UNIX platforms, similar to XtalFinder

Ah, I hadn't noticed this either.

@mkhorton were you referring to e.g. [AverageKernel] (https://singroup.github.io/dscribe/latest/tutorials/similarity_analysis/kernels.html#average-kernel) or e.g. Coulomb matrix? [...]

I've tried AverageKernel and REMatchKernel with the SOAP vectors; I would probably start with the latter. There are a lot of variables to tune with both the SOAP vectors themselves and the kernels however.

kjappelbaum commented 2 years ago

Perhaps the easiest form of pre-filtering is via comparing reduced chemical formulas. IIUC, use of ElementComparator in StructureMatcher means that if the reduced chemical formula doesn't match, there's no way for the structures to match. Since most of the chemical formulas won't match, the number of comparisons can be dramatically reduced. Seem sound?

Yep, I've been also using this in the past (and also tried to supplement with heuristic based on Kabsch RMSD, and density): https://github.com/kjappelbaum/structure_comp/blob/e903be460f9ed609814240a8b2b7c1702e232a48/structure_comp/remove_duplicates.py#L453. However, now I mostly fall back to the Weisfeiller-Lehman hashes (of the undirected quotient graph) that you can precompute. This will give you a small set of potential matches that you can then check with refined methods.

sgbaird commented 2 years ago

@kjappelbaum I think I'll try out the Weisfeiller-Lehman hashes.

kjappelbaum commented 2 years ago

@kjappelbaum I think I'll try out the Weisfeiller-Lehman hashes.

happy to help out! Do we have a good test suite for this? Shall we go with the one of the StructureMatcher?

sgbaird commented 2 years ago

@kjappelbaum that would be great! Please do. I think you have access through the xtal2png team. The tests are thorough, mainly in checking that outputs match hardcoded values for each of the three classes.

kjappelbaum commented 2 years ago

The tests are thorough, mainly in checking that outputs match hardcoded values for each of the three classes.

cool, currently I test on MOFs (that's my use case ;), see the tests here https://github.com/kjappelbaum/mofchecker/blob/4cae2a27cb9cd5e953b4509d7329af83baa5e31b/tests/test_hash.py#L20

I know of some edge cases there, would be good to find similar ones in the rest of the materials space. The edge cases I currently know of mostly boil down to structure graphs that are different due to the heuristics used to compute the structure graphs.

sgbaird commented 2 years ago

For now, going with compositional and structural fingerprint-based comparisons with thresholds is relatively fast (~15 min for 100 generated structures and 5 folds, compared with ~4 hrs from earlier). Planning to address one last bottleneck by loading precalculated space group numbers for the training and test structures instead of recalculating repeatedly.

kyledmiller commented 2 years ago

Really interesting discussion here, thanks for bringing me in @sgbaird. I have a few comments and clarifications.

StructureMatcher

Pro: More flexible, can associate structures which are very similar but with small, symmetry-breaking distortions For generative models, especially when symmetry isn't directly imposed, this seems like an important feature. Related: qzhu2017/PyXtal#199. Forcing models to respect a given symmetry seems like a better route but a lot less straightforward. I don't think even CDVAE, which I consider state-of-the-art for crystal generative models, enforces symmetry explicitly.

Yes, definitely flexibility can be a boon for generative models but I think the specificity/restrictiveness (isopointal grouping) of XtalFinder can be a strong advantage for use cases involving symmetry-dependent properties (topology-related phenomena, symmetry-breaking transitions, etc.).

XtalFinder

Con: Less flexible, associated structures must be in the same space group, have the same occupied Wyckoff sites,

This is really good to know. I don't think I would have caught that nuance immediately. Kind of goes back to @mkhorton's #9 (comment).

I should also note here despite what I said above, XtalFinder's symmetry determination uses an adaptive tolerance (described in the AFLOW-SYM paper) which provides a small amount built-in flexibility for small distortions. This means that an isopointal structure grouping generated by XtalFinder might actually consist of structures which would yield different space groups if one used spglib with the same absolute tolerance across the whole group. However, it seems like XtalFinder decides on a uniform space group to assign these structures and essentially resymmetrizes them to fit in the isopointal grouping.

Con: Because of the aforementioned flexibility, matching corresponding Wyckoff sites between structures is nontrivial

Is the main issue here that it's computationally expensive?

Here I was just thinking that a group of structures with different symmetries would not have a nice 1-1 correspondence between Wyckoff sites for every structure . Of course one could find a way around this, e.g., resymmetrizing every structure to the highest symmetry or using group-subgroup relations to connect each high-symmetry parent wyckoff site in high-symmetry structures to its corresponding lower-symmetry children sites in the lower-symmetry structures (probably could use spglib for this) and might be computationally expensive.

Con: Doesn't assign unique/stable identifiers/to the structure groups

Is the idea here that we get nice groupings of structures similar to how we can request all chemical formulas from a given chemical system? Is there any benefit in terms of speed-up or is the main benefit provenance and consistency?

I think you're spot on here. It's for provenance and consistency, not speed-up. The former is important for AFLOW's prototype encyclopedia.

For now, going with compositional and structural fingerprint-based comparisons with thresholds is relatively fast (~15 min for 100 generated structures and 5 folds, compared with ~4 hrs from earlier). Planning to address one last bottleneck by loading precalculated space group numbers for the training and test structures instead of recalculating repeatedly.

That speed-up is impressive. That's comparing to zero pre-screening, not even compositional, right? I'd be curious how much speed-up you get between compositional pre-screening and compositional+structural pre-screening.

I'd also love to look at how you use the Weisfeiller-Lehman hashes. Is the structure->graph transformation code publicly available? I've been grappling with the question of how to determine 'bonded' neighbors lately.

sgbaird commented 2 years ago

@kyledmiller thanks for clarifying! Really useful info.

That speed-up is impressive. That's comparing to zero pre-screening, not even compositional, right? I'd be curious how much speed-up you get between compositional pre-screening and compositional+structural pre-screening.

Correct, StructureMatcher without pre-screening (note that the StructureMatcher method would be maybe 2x faster if I had used group_structures instead https://github.com/materialsproject/pymatgen/issues/2593#issuecomment-1205218522). However, this is with the caveat that compositional and structural fingerprint featurization of the ~40k compounds was computed ahead of time (~3 clock hrs on 6 physical CPU cores --> 12 threads, see figshare).

I'd also love to look at how you use the Weisfeiller-Lehman hashes. Is the structure->graph transformation code publicly available? I've been grappling with the question of how to determine 'bonded' neighbors lately.

@kjappelbaum gave an example of what that might look like for this repo. If the fingerprint method is sufficiently fast and robust/valid, we may not incorporate that here unless we switch back to StructureMatcher, XtalFinder, or similar.

sparks-baird / matbench-genmetrics

cheaper matching #9

from internal discussion, by @kyledmiller

Comments

StructureMatcher

XtalFinder

StructureMatcher

XtalFinder