hackingmaterials / matminer

Data mining for materials science
https://hackingmaterials.github.io/matminer/
Other
482 stars 194 forks source link

Recomputing the Voronoi Tessalation Repeatedly #708

Closed CompRhys closed 3 years ago

CompRhys commented 3 years ago

I am trying to compare a new model to the ward-voronoi features on ~300k materials. Looking into the code it appears that each sub-featurizer of the feature set recalculated the voronoi tessalation itself. This means that the code runs ~5x slower than it might otherwise do.

I might be missing something but is there anyway that this can be restructured to allow or pre-computation? i.e. avoid calling this nn = get_nearest_neighbors(VoronoiNN(weight=self.weight), strc, idx) repeatedly by calculating it once and then passing around the nn object to the featurizers?

@WardLT

WardLT commented 3 years ago

Hello @CompRhys ,

We've implemented a caching strategy that stores the last Voronoi tessellations so we need not repeat overfeatures. If you use the MultipleFeaturizer to compute features (example with an old version of matminer here: link), it'll loop over entries in such a way that you use the cache.

As a heads-up, it still may take an hour or more to compute features for all 300k.

CompRhys commented 3 years ago

opps! didn't find that.

Currently it says that it will take closer to 300 hours wall time which is why I wanted to look at whether it could be sped up. I will just have to look for more CPUs that I can get access to.

CompRhys commented 3 years ago

profiling the code it does look as though only 100 calls are made to get_all_voronoi_polyhedra so it appears that's there's not scope for speeding it up without making changes in pymatgen.

         73142916 function calls (69961005 primitive calls) in 138.403 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      858   24.021    0.028   55.424    0.065 local_env.py:672(get_voronoi_polyhedra)
197629/2529   22.805    0.000   25.235    0.010 local_env.py:423(_get_nn_shell_info)
     1710   16.913    0.010   32.574    0.019 local_env.py:802(_extract_cell_info)
       99    4.457    0.045   24.723    0.250 local_env.py:734(get_all_voronoi_polyhedra)
3209254/1660813    3.720    0.000   26.915    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       97    3.558    0.037    5.486    0.057 structure.py:1358(get_all_neighbors)
   152044    3.036    0.000    9.637    0.000 numeric.py:1452(cross)
  1200164    2.831    0.000    2.831    0.000 {built-in method numpy.array}
  2422174    2.414    0.000    2.414    0.000 structure.py:140(__getitem__)
      100    2.401    0.024   57.729    0.577 order.py:202(<listcomp>)
   969570    1.972    0.000    1.972    0.000 structure.py:100(__init__)
      858    1.820    0.002    5.734    0.007 structure.py:1142(get_sites_in_sphere)
   912652    1.786    0.000    2.697    0.000 numeric.py:1308(normalize_axis_tuple)
   551068    1.751    0.000    1.751    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   456326    1.697    0.000    5.182    0.000 numeric.py:1371(moveaxis)
   164086    1.611    0.000    4.078    0.000 numeric.py:2274(within_tol)
   164086    1.210    0.000    9.593    0.000 numeric.py:2197(isclose)
  2810931    1.123    0.000    1.123    0.000 sites.py:82(species)
    24552    1.074    0.000    8.504    0.000 local_env.py:1897(solid_angle)
  4174607    1.043    0.000    1.359    0.000 {method 'get' of 'dict' objects}
   348518    1.033    0.000    4.563    0.000 composition.py:165(__eq__)
ardunn commented 3 years ago

@CompRhys do you have any ideas about how to implement this in pymatgen? or is there already a ticket open on pmg repo?

CompRhys commented 3 years ago

I just that I missed that matminer was already caching, I think there's not actually anything that can be done to speed this up apart from getting a better scipy implementation for periodic voronoi diagrams (see this issue I've had for a while https://github.com/scipy/scipy/issues/11858#issue-600141315).

I normally use matminer on my dual core laptop from 2015 which is why it seemed unbearably slow which made me think it must be recomputing hence the issue. In the end I moved it over to the cluster and booked ~250 cores it only took ~ 2 real hours which was fine.