markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
307 stars 119 forks source link

Suggestion: Coordinates API #167

Closed clonker closed 9 years ago

clonker commented 9 years ago

Suggestion about the API design: Since the API is what non-developers use it should be easy to understand and give not much space for mistakes. A suggestion is to have a construction like

featurizer = Featurizer("test.pdb")
featurizer.add_angles(indices)

pipeline = Pipeline("input.xtc")
# actually sets the featurizer of the reader, just moved to top-level api
pipeline.set_featurizer(featurizer)

pipeline.add(api.transform.TICA, lag=2)
pipeline.add(api.clustering.kmeans, ncenters=1000)

pipeline.perform()

cov = pipeline.tica.cov
centers = pipeline.kmeans.cluster_centers

pipeline.add(api.msm.hmm, states=5)
pipeline.perform()

transition_matrix = pipeline.hmm.transition_matrix
franknoe commented 9 years ago

Please find my comments inline:

featurizer = Featurizer("test.pdb")
featurizer.add_angles(indices)
# FN: I agree

pipeline = Pipeline("input.xtc")
pipeline.set_featurizer(featurizer)
# FN: I like the idea of initializing the Pipeline with the input files. But we could pass the 
# featurizer with the constructor. Arguments should be identical to the load() function 
# (see my API suggestion notebook, i.e. following constructs should work:
# p = coor.pipeline('in.xtc')     # one xtc trajectory with the default featurizer (plain coords)
# p = coor.pipeline('in.xtc', featurizer)     # one xtc trajectory with a featurizer
# p = coor.pipeline(['1.xtc','2.xtc'], featurizer)    # multiple trajectories
# p = coor.pipeline(['1.dat','2.dat'])    # read features from tabulated ascii files

pipeline.add(api.transform.TICA, lag=2)
pipeline.add(api.clustering.kmeans, ncenters=1000)
# FN: I like the idea. Avoids the step of separately creating objects. Also very general
# The downside is you have to do more in the next step when retrieving results. If you 
# had created an object named tica, you could directly access it after the perform() 
# command. So I'm not sure if it pays off.
# If we do it like this: to make clear how to use that we should make all arguments in the 
# in-memory functions or
# object initializations to kwargs, and then allow identical kwards here (just pass them on).
# We could also have keywords for the native objects 'tica', 'cluster_kmeans', e.g.:
# pipeline.add('tica', lag=2)
# This would make things more dynamical, i.e. a user could more easily pass arguments 
# to a script in order to keep e.g. the clustering method flexible

pipeline.perform()
# FN: I find the name confusing because it only fits the parameters, but does not do the 
# mapping. What about fit() or parametrize()? Perform is unclear 

cov = pipeline.tica.cov
centers = pipeline.kmeans.cluster_centers
# FN: so this means you are, upon 'add' dynamically creating a variable with the name 
# of the class, right? this is a very nice idea. But it should be unambiguous which variable 
# names to expect.  For example class name TICA and variable name tica is confusing. 
# An alternative would be that the user can define a name when adding and that could 
# be either mapped to a variable name, or to a dict key, e.g. after 
# pipeline.add('tica', api.transform.TICA. lag=2) one accesses with
# pipeline.get('tica').cov, or pipeline['tica'].cov

pipeline.add(api.msm.hmm, states=5)
# FN: not sure if hmm/msm should be pipeline objects, because they might have a fit(), 
# but not a map(), and they differ from transformer objects in that they don't take a 
# ndarrays as an input.
# But it's certainly possible to do it that way. We can also enable things to be 
# pipeline objects later.

pipeline.perform()
# FN: see above

transition_matrix = pipeline.hmm.transition_matrix

# FN: Open question: How do we get actual data out? 
# For example, I want e.g. the projected tica coordinates and I don't simply wanna 
# create a writer but get the result in memory.
# example: pipeline.tica.get_mapped() or .get_output()
# note that I am using the get_ here because this is not an attribute but a (heavy) function.
marscher commented 9 years ago

branch "refactor_discretizer" is the place where we will implement new stuff related to the api.

franknoe commented 9 years ago

Thanks! working on it now.


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented 9 years ago

Here we go:

franknoe commented 9 years ago
marscher commented 9 years ago

we should think of a new name for "input" api function, since it shadows a python builtin

franknoe commented 9 years ago

I am closing this issue, because it is done apart from a few small, specific issues (that are explicitly spelled out in separate issues now). Thanks guys!!!