Closed CompRhys closed 3 years ago
Hello @CompRhys ,
We've implemented a caching strategy that stores the last Voronoi tessellations so we need not repeat overfeatures. If you use the MultipleFeaturizer to compute features (example with an old version of matminer here: link), it'll loop over entries in such a way that you use the cache.
As a heads-up, it still may take an hour or more to compute features for all 300k.
opps! didn't find that.
Currently it says that it will take closer to 300 hours wall time which is why I wanted to look at whether it could be sped up. I will just have to look for more CPUs that I can get access to.
profiling the code it does look as though only 100 calls are made to get_all_voronoi_polyhedra
so it appears that's there's not scope for speeding it up without making changes in pymatgen
.
73142916 function calls (69961005 primitive calls) in 138.403 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
858 24.021 0.028 55.424 0.065 local_env.py:672(get_voronoi_polyhedra)
197629/2529 22.805 0.000 25.235 0.010 local_env.py:423(_get_nn_shell_info)
1710 16.913 0.010 32.574 0.019 local_env.py:802(_extract_cell_info)
99 4.457 0.045 24.723 0.250 local_env.py:734(get_all_voronoi_polyhedra)
3209254/1660813 3.720 0.000 26.915 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
97 3.558 0.037 5.486 0.057 structure.py:1358(get_all_neighbors)
152044 3.036 0.000 9.637 0.000 numeric.py:1452(cross)
1200164 2.831 0.000 2.831 0.000 {built-in method numpy.array}
2422174 2.414 0.000 2.414 0.000 structure.py:140(__getitem__)
100 2.401 0.024 57.729 0.577 order.py:202(<listcomp>)
969570 1.972 0.000 1.972 0.000 structure.py:100(__init__)
858 1.820 0.002 5.734 0.007 structure.py:1142(get_sites_in_sphere)
912652 1.786 0.000 2.697 0.000 numeric.py:1308(normalize_axis_tuple)
551068 1.751 0.000 1.751 0.000 {method 'reduce' of 'numpy.ufunc' objects}
456326 1.697 0.000 5.182 0.000 numeric.py:1371(moveaxis)
164086 1.611 0.000 4.078 0.000 numeric.py:2274(within_tol)
164086 1.210 0.000 9.593 0.000 numeric.py:2197(isclose)
2810931 1.123 0.000 1.123 0.000 sites.py:82(species)
24552 1.074 0.000 8.504 0.000 local_env.py:1897(solid_angle)
4174607 1.043 0.000 1.359 0.000 {method 'get' of 'dict' objects}
348518 1.033 0.000 4.563 0.000 composition.py:165(__eq__)
@CompRhys do you have any ideas about how to implement this in pymatgen? or is there already a ticket open on pmg repo?
I just that I missed that matminer was already caching, I think there's not actually anything that can be done to speed this up apart from getting a better scipy implementation for periodic voronoi diagrams (see this issue I've had for a while https://github.com/scipy/scipy/issues/11858#issue-600141315).
I normally use matminer on my dual core laptop from 2015 which is why it seemed unbearably slow which made me think it must be recomputing hence the issue. In the end I moved it over to the cluster and booked ~250 cores it only took ~ 2 real hours which was fine.
I am trying to compare a new model to the ward-voronoi features on ~300k materials. Looking into the code it appears that each sub-featurizer of the feature set recalculated the voronoi tessalation itself. This means that the code runs ~5x slower than it might otherwise do.
I might be missing something but is there anyway that this can be restructured to allow or pre-computation? i.e. avoid calling this
nn = get_nearest_neighbors(VoronoiNN(weight=self.weight), strc, idx)
repeatedly by calculating it once and then passing around the nn object to the featurizers?@WardLT