openforcefield / smarty

Chemical perception tree automated exploration tool.
http://openforcefield.org
MIT License
19 stars 8 forks source link

Potential speed up for smirky by removing chemical environment tracking #251

Open bannanc opened 7 years ago

bannanc commented 7 years ago

I will start by saying I'm not sure if this is worth it at this point in the project, but this is definitely worth thinking about in the long run.

Right now, smirky keeps a list of chemical environments for the parameter it is sampling. It occurred to me yesterday that this is probably fairly memory intensive and unnecessary since it is easy to go from SMIRKS to ChemicalEnvironments and back to SMIRKS. Would it speed up the computational time to remove the chemical environments storage? We could instead only store the SMIRKS strings (and typenames) for the parameter list. Then the create_new_environment function could take a smirks string and typename, generate an environment, make changes and return a smirks string and new typename. It seems like it might speed things up significantly since right now we are keeping a chemicalEnvironment for every parameter, which is in the 100's for Torsions and each of the environments is a complicated NetworkX graph with a lot of information that could just be tracked as a SMIRKS string.

@davidlmobley

davidlmobley commented 7 years ago

Is there anything quick you can do to test how much this would affect speed? Naively it seems like it would not be worth it at this point for SMIRKY for the paper, but will be helpful down the line. But I guess that depends on how much additional simulation you still need to do and how much the speed difference is.

bannanc commented 7 years ago

The fastest test I can foresee doing is making two simple scripts that store lists for many steps where the list is either SMIRKS strings or chemical environments. Then see if there is a substantial time difference. I'm not sure this is worth the effort now, but it seems like we could potentially be doing A LOT more smirky simulations unless we find why smarty isn't getting 100%

davidlmobley commented 7 years ago

@bannanc - I think it's worth trying to test the speed difference if you can do it without a huge investment of time.

bannanc commented 7 years ago

I forgot to add my notes here. Below I have data for a list where I sometimes added a new object. The first column is with storing smirks strings the second is for storing chemical environments the last is the difference in minutes (that is string column - environment column). generic I started the simulation with only a generic torsion initially, short starts with 10 torsions and long starts with 82 torsions for a recent smirky run.

------------------------------  2 Iterations  ------------------------------
               short    1.97e-05    6.54e-05    4.57e-05
                long    1.93e-05    4.58e-04    4.39e-04
             generic    1.34e-05    1.82e-05    4.84e-06

------------------------------  10 Iterations  ------------------------------
               short    7.12e-05    1.16e-04    4.53e-05
                long    8.27e-05    5.40e-04    4.58e-04
             generic    6.60e-05    6.47e-05    -1.23e-06

------------------------------  100 Iterations  ------------------------------
               short    6.19e-04    7.01e-04    8.20e-05
                long    7.44e-04    1.36e-03    6.12e-04
             generic    5.49e-04    6.28e-04    7.92e-05

------------------------------  1000 Iterations  ------------------------------
               short    7.59e-03    1.73e-02    9.76e-03
                long    8.42e-03    2.10e-02    1.26e-02
             generic    6.89e-03    1.61e-02    9.20e-03

------------------------------  10000 Iterations  ------------------------------
               short    8.89e-02    1.09e+00    9.98e-01
                long    9.37e-02    1.17e+00    1.08e+00
             generic    7.18e-02    1.12e+00    1.05e+00

------------------------------  30000 Iterations  ------------------------------
               short    3.61e-01    1.04e+01    1.00e+01
                long    4.51e-01    1.08e+01    1.04e+01
             generic    3.13e-01    1.01e+01    9.76e+00

In this example I only let the list of smirks/environments get longer. The time difference doesn't seem so terrible on the time scale we use for smirky 10,000 with adding and removing. I looked a little at time/iteration and with strings it is pretty consistently 5-10E-06 minutes. With chemical environments there is less consistency, but time/iteration seems to get longer with longer iterations (which is probably to say that the time/iteration is more sensitive to the length of the list when it is storing chemical environments.

I don't think it is worth the effort to re-code smirky now, but I think it is a reasonable thing to remember as we move forward and we may need to store information about the chemical perception for longer while sampling both the SMIRKS patterns and the parameters.

bannanc commented 7 years ago

I have the jupyter notebook I used in my person documents on google drive, but I can put it somewhere public or in the utilities here if we want.

davidlmobley commented 7 years ago

Thanks for checking this, @bannanc . I think for the record you want to share your notebook so that when someone revisits this, they can pick up from where you left off from this info.

bannanc commented 7 years ago

@davidlmobley should I just put that notebook in utilities here?

davidlmobley commented 7 years ago

Sounds good.