Removing time bottlenecks for Desmond parsing

mrshirts commented 7 years ago

Seems like match_pairs, match_angles, and match_dihedrals are really slowing things down.
Looking at a large file, running cProfile, we get at the top.

     112706236 function calls (108430061 primitive calls) in 1582.537 seconds

  Ordered by: internal time

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  71934  693.931    0.010  693.931    0.010 moleculetype.py:35(_match_two_atoms)
  32632  518.931    0.016  518.931    0.016 moleculetype.py:53(_match_four_atoms)
  31051  245.683    0.008  245.683    0.008 moleculetype.py:44(_match_three_atoms)
  1780064   19.714    0.000   21.152    0.000 shlex.py:120(read_token)
 1305782   14.588    0.000   15.055    0.000 {eval}
 4439553/856244    8.307    0.000   19.878    0.000 copy.py:145(deepcopy)
 3164000    3.763    0.000    5.607    0.000 quantity.py:97(__init__)
 1652604    3.199    0.000    4.765    0.000 unit.py:300(is_compatible)
 15908286    2.959    0.000    2.962    0.000 unit.py:195(__hash__)
 2888806    2.944    0.000    3.451    0.000 copy.py:267(_keep_alive)
 1    2.889    2.889   35.511   35.511 desmond_parser.py:1686(write)

So _match_two_atoms, _match_three_atoms, _match_four_atoms take 1459/1583 = 92% of the time. _match_two_atoms, most of the time is in match_pairs instead of match_bonds, match_angles is _match_three_atoms, match_dihedrals is _match_four_atoms.

Not sure what the solution is, but just reporting.

ctk3b commented 7 years ago

Yeah this is just being called way too many times to be implemented as a double for-loop. I think we can pretty easily reorganize the data so that this becomes a set lookup or the atoms are keys in a dict.

mrshirts commented 7 years ago

What would the pseudocode look for this for two bonds? We explicitly add to each force the list of bonds or pairs? I can take a pass if you have explicit ideas.

ctk3b commented 7 years ago

I think the easiest way will be to build up an intermediate bond dictionary in the desmond parser that is keyed by a tuple of the atoms (probably by indices).

So specifically, here load them into a dict keyed by those indices and then a few lines down when you're finding matches you just do old_bond = intermediate_bond_dict[(new_bond.atom1.index, new_bond.atom2.index)]

After all are parsed and filtered, add the dict values to the bond_force set

mrshirts commented 7 years ago

Lists aren't hashable, though? Or am I misunderstanding something?

mrshirts commented 7 years ago

I guess could be done as tuple. . .

ctk3b commented 7 years ago

Yes, the keys would be a tuple of indices

shirtsgroup / InterMol

Removing time bottlenecks for Desmond parsing #339