Extensions to Topology for getting number of atoms, residues, and chains

jchodera commented 8 years ago

It would be useful to extend Topology to retrieve the number of atoms, residues, and chains.

The current idioms are

nchains = sum(1 for c in topology.chains())
nres = sum(1 for r in topology.residues())
natom = sum(1 for a in topology.atoms())

The most OpenMM-like extension would be to add

nchains = topology.getNumChains()
nres = topology.getNumResidues()
natom = topology.getNumAtoms()

but it would also be useful to add the following @property attributes:

nchains = topology.nchains
nres = topology.nres
natom = topology.natoms

These are all easy for me to add in a PR, but I wanted to ask to see if there was some utility in matching an existing interface for an OpenMM-like Topology, such as the one in MDTraj or another tool.

Any thoughts?

rmcgibbo commented 8 years ago

I don't have a strong opinion, but this seems like a fine idea. My weak preference would be for the getNumChains-style spelling.

jchodera commented 8 years ago

Great.

My thought was to add both the topology.getNumChains()style getter and topology.nchains style property.

I'll create a PR unless there are objections.

rmcgibbo commented 8 years ago

I would just go with one. Redundancy like that is confusing.

swails commented 8 years ago

My thought was to add both the topology.getNumChains()style getter and topology.nchains style property.

-1 on both:

bash-4.3$ python -m this:

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it. <----
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

I like the getNumChains() better as well -- it's not as Pythonic, but it's more OpenMM-like.

jchodera commented 8 years ago

I'm not sure there is any harm to be done by adding more functionality that is more Pythonic on top of the unpythonic OpenMM-style getters, but perhaps that is best left for a more global pythonic overhaul of the Python API (where we don't have to remove functionality to add more pythonic mechanisms of data access).

jchodera commented 8 years ago

Redundancy like that is confusing.

Isn't numpy an example of an incredibly useful library where there are like six ways to do anything via the API?

swails commented 8 years ago

Isn't numpy an example of an incredibly useful library where there are like six ways to do anything via the API?

A couple observations:

Just because a library is useful doesn't make all of its design decisions good ones.
What you're proposing here is introducing synonymous functionality at the same time in which the only purpose of each function is exactly the same. The examples in numpy are typically complex, added at different times, and often expose different ways of accomplishing the same goals because mathematics itself has redundancies.

numpy development has a long and rich history, and they often need to keep an "inferior" way of doing things around to maintain backwards compatibility, even when a better alternative exists (with a DeprecationWarning). There are also frequently performance reasons for implementing specific "versions" of doing the same thing (e.g., diagonalizing a sparse vs. dense vs. symmetric matrix).

rmcgibbo commented 8 years ago

Sure, but they are also trying to remove a lot of the needless redundancy that they've accumulated over the years -- backward compatibility is a countervailing factor though :). I'm not trying to propose some overarching rule, I'm just ¯(ツ)/¯ on adding two identical-but-differently-spelled methods/properties on simtk.openmm.app.Topology.

rmcgibbo commented 8 years ago

If ya'll havn't seen it, this is an interesting discussion https://mail.scipy.org/pipermail/numpy-discussion/2015-July/073205.html that involves Nathaniel coming up with crazy ctypes interpeter hacks just to be able to add deprecation warnings to module-level globals.

swails commented 8 years ago

I'm not sure there is any harm to be done by adding more functionality that is more Pythonic on top of the unpythonic OpenMM-style getters, but perhaps that is best left for a more global pythonic overhaul of the Python API (where we don't have to remove functionality to add more pythonic mechanisms of data access).

I think this would do a disservice to many and OpenMM as a whole. One of the things that has helped me personally with OpenMM is that I learned OpenMM in Python, and was immediately able to start being productive in C++ for something that necessitated the speedup. Why make people learn two different APIs for the same package when there's already one that's identical?

rmcgibbo commented 8 years ago

swails commented 8 years ago

Note if this is only referring to the app layer (and everything SWIG-generated is left set in stone), my opposition is not as strong.

I still think PEP 8 and the Zen are reasonably definitive in favor of consistency and no-simple-synonyms, but as long as Python code can be translated almost exactly to C++ wherever the APIs overlap, the most important places to keep things consistent are preserved.

jchodera commented 8 years ago

Why make people learn two different APIs for the same package when there's already one that's identical?

That's kind of like saying "Why make people learn Python when there's already C++?" :)

The answer, of course, is that certain tasks are easier to accomplish in one than in another.

For example, I'd love a way to access information about atom index in a more pythonic way:

name = top.atoms[index].name

versus the current simtk.openmm.app.Topology way:

counter = 0
name = None
for atom in top.atoms():
   counter += 1
   if counter == index:
      name = atom.name
      break

or maybe

atoms = [ atom for atom in top.atoms() ] # why do I have to build this myself every time?
name = atoms[index].name

There are sometimes strong arguments to be made for having multiple ways to do something because it can save a lot of time down the road, not just in writing an implementation, but in figuring out which idiom is appropriate. I'm not saying that this is one of those cases in which a strong argument can be made---just that "there's already one way to do it, so why would you want another?" seems like a needlessly self-limiting philosophy.

Because of the disagreement here, I've created a PR that just implements the API calls everyone can agree on.

jchodera commented 8 years ago

The numpy stuff is fascinating, though! I had no idea...

swails commented 8 years ago

That's kind of like saying "Why make people learn Python when there's already C++?" :)

No it's not.

The answer, of course, is that certain tasks are easier to accomplish in one than in another.

And this is why :). top.nchains is not easier or more user-friendly than top.getNumChains() in a library that clearly does not care about the 7 extra keystrokes (see, e.g., AmoebaGeneralizedKirkwoodForce).

For example, I'd love a way to access information about atom index in a more pythonic way:
name = top.atoms[index].name
versus the current simtk.openmm.app.Topology way:
counter = 0
name = None
for atom in top.atoms():
   counter += 1
   if counter == index:
      name = atom.name
      break
or maybe
atoms = [ atom for atom in top.atoms() ] # why do I have to build this myself every time?
name = atoms[index].name

But this is completely different! These are different functionalities, with very different behaviors. And they solve real problems. Iterating through from the beginning to find the desired atom because you need to negotiate a generator can be highly inefficient -- that's O(N^2) when selecting an unordered set of atoms in a naive way compared to O(N) if it was a flat list (or even an indexable generator like range). The workarounds that are needed to achieve that benefit effectively require caching the lists, which carries with them all the risks of relying on a cache that doesn't know when it's out-of-date and needs to be regenerated. That's why linked lists and arrays both exist, even though you can do the same things with both.

Compare that to the proposal here:

assert topology.getNumChains() == topology.nchains

Because one looks "prettier" without the camelCase, "get", and the obvious function call indicated by (). Both have the same complexity and are implemented in exactly the same way -- it's just that the function call for the descriptor is buried inside the object implementation. Literally the only difference is that the property approach eliminates the possible mistake of leaving out the ().

The reason this (bikeshedding) issue is so important to me is that learning OpenMM is still a fairly recent experience (for me). I learned it by reading all kinds of code. Modifying existing code. Playing with it. While not "Pythonic", the API is clean and simple, which made it easy to learn. It would not have been any easier to learn for me had getters and setters been replaced by descriptors -- it would've just given me a warmer, fuzzier feeling inside ;). I see duplicity like this as one of the first steps toward a ballooning complexity that impedes learning-by-reading-source-code (after all, the whole mantra of Python is that source code is read more often than written).

There are times when forces combine to make overlapping functionality in different parts of the API unavoidable (see numpy, for example), but if we don't eliminate obvious redundancies like t.getNumChains() and t.nchains, I don't see it ever happening.

peastman commented 8 years ago

Although that way may not be obvious at first unless you're Dutch.

Huh?

atoms = [ atom for atom in top.atoms() ] # why do I have to build this myself every time?

Or more concisely,

atoms = list(top.atoms())

swails commented 8 years ago

Although that way may not be obvious at first unless you're Dutch. Huh?

GvR is Dutch. I think it's a clever way of saying that it really only matters what GvR thinks is obvious since he's BDFL.

swails commented 8 years ago

Of course, in jest, not in seriousness (although he's the tiebreaking vote, I would presume) -- he doesn't strike me as one to presume he knows best for everything.

peastman commented 8 years ago

Ah, I get it!

peastman commented 8 years ago

This was implemented in #1210.

openmm / openmm

Extensions to Topology for getting number of atoms, residues, and chains #1207