Suggestion: Coordinates API - Githubissues

markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂

http://pyemma.org

GNU Lesser General Public License v3.0

307 stars 119 forks source link

Suggestion: Coordinates API #167

Closed clonker closed 9 years ago

clonker commented 9 years ago

Suggestion about the API design: Since the API is what non-developers use it should be easy to understand and give not much space for mistakes. A suggestion is to have a construction like

featurizer = Featurizer("test.pdb")
featurizer.add_angles(indices)

pipeline = Pipeline("input.xtc")
# actually sets the featurizer of the reader, just moved to top-level api
pipeline.set_featurizer(featurizer)

pipeline.add(api.transform.TICA, lag=2)
pipeline.add(api.clustering.kmeans, ncenters=1000)

pipeline.perform()

cov = pipeline.tica.cov
centers = pipeline.kmeans.cluster_centers

pipeline.add(api.msm.hmm, states=5)
pipeline.perform()

transition_matrix = pipeline.hmm.transition_matrix

franknoe commented 9 years ago

Please find my comments inline:

featurizer = Featurizer("test.pdb")
featurizer.add_angles(indices)
# FN: I agree

pipeline = Pipeline("input.xtc")
pipeline.set_featurizer(featurizer)
# FN: I like the idea of initializing the Pipeline with the input files. But we could pass the 
# featurizer with the constructor. Arguments should be identical to the load() function 
# (see my API suggestion notebook, i.e. following constructs should work:
# p = coor.pipeline('in.xtc')     # one xtc trajectory with the default featurizer (plain coords)
# p = coor.pipeline('in.xtc', featurizer)     # one xtc trajectory with a featurizer
# p = coor.pipeline(['1.xtc','2.xtc'], featurizer)    # multiple trajectories
# p = coor.pipeline(['1.dat','2.dat'])    # read features from tabulated ascii files

pipeline.add(api.transform.TICA, lag=2)
pipeline.add(api.clustering.kmeans, ncenters=1000)
# FN: I like the idea. Avoids the step of separately creating objects. Also very general
# The downside is you have to do more in the next step when retrieving results. If you 
# had created an object named tica, you could directly access it after the perform() 
# command. So I'm not sure if it pays off.
# If we do it like this: to make clear how to use that we should make all arguments in the 
# in-memory functions or
# object initializations to kwargs, and then allow identical kwards here (just pass them on).
# We could also have keywords for the native objects 'tica', 'cluster_kmeans', e.g.:
# pipeline.add('tica', lag=2)
# This would make things more dynamical, i.e. a user could more easily pass arguments 
# to a script in order to keep e.g. the clustering method flexible

pipeline.perform()
# FN: I find the name confusing because it only fits the parameters, but does not do the 
# mapping. What about fit() or parametrize()? Perform is unclear 

cov = pipeline.tica.cov
centers = pipeline.kmeans.cluster_centers
# FN: so this means you are, upon 'add' dynamically creating a variable with the name 
# of the class, right? this is a very nice idea. But it should be unambiguous which variable 
# names to expect.  For example class name TICA and variable name tica is confusing. 
# An alternative would be that the user can define a name when adding and that could 
# be either mapped to a variable name, or to a dict key, e.g. after 
# pipeline.add('tica', api.transform.TICA. lag=2) one accesses with
# pipeline.get('tica').cov, or pipeline['tica'].cov

pipeline.add(api.msm.hmm, states=5)
# FN: not sure if hmm/msm should be pipeline objects, because they might have a fit(), 
# but not a map(), and they differ from transformer objects in that they don't take a 
# ndarrays as an input.
# But it's certainly possible to do it that way. We can also enable things to be 
# pipeline objects later.

pipeline.perform()
# FN: see above

transition_matrix = pipeline.hmm.transition_matrix

# FN: Open question: How do we get actual data out? 
# For example, I want e.g. the projected tica coordinates and I don't simply wanna 
# create a writer but get the result in memory.
# example: pipeline.tica.get_mapped() or .get_output()
# note that I am using the get_ here because this is not an attribute but a (heavy) function.

marscher commented 9 years ago

branch "refactor_discretizer" is the place where we will implement new stuff related to the api.

franknoe commented 9 years ago

Thanks! working on it now.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented 9 years ago

Here we go:

coordinates.api.load(trajnames, featurizer, topology, stride)
- TODO: implement this. See docstring
- There’s a catch here: when loading molecular dynamics trajectories, the user probably expects each frame to have shape (N,3). However, all subsequent methods (TICA etc) expect a vector. Should we by default flatten the data, or only on request? Alternatively could be flexible in all api methods and transformer map methods and to get a version of the arrays that is flat over all dimensions except the first. In stallone times we have used the second option. I would vote for that if it’s possible to get a reshaped view and not necessary to copy the array to do a reshape.
coordinates.api.feature_reader(topfile):
coordinates.api.memory_reader(topfile):
- I think we don’t need these methods anymore since we have coordinates.api.input (see below). Deprecate?
Transformer.map(X) method of transformer base class:
- This is a new function that either accepts an ndarray or list of ndarrays, and then returns either an ndarray, or a list of ndarrays.
TICA.map(X) and correspondingly the map(X) methods of all other specific transformers:
- refactord all map(X) methods to _map_array(X) which specifically takes ndarrays. Only this method needs to be overridden by specific transformers.
Clustering methods: def get_discrete_trajectories(self):
- TODO: shouldn’t this be in the base class and shared by all cluster methods?
coordinates.clustering.regspatial:
- Since the regspatial C module is used to do assignment for all clustering methods, I suggest that it should be renamed to something more general. E.g. clustering
AbstractClustering.assign(X):
- Does the assignment buy just calling map(X) of the Transformer base class. This is just for sake of nomenclature
coordinates.api.input(...):
- This has not been discussed. Now I suggest to actually split the generation of an input reader (MemoryReader, FeatureReader, …) from the pipeline construction. Reason: there are use-case for pipelines if coordinates are in memory. If I have, e.g. Cartesian coordinates in memory, but want to transform them to distances to do TICA etc, the data size will expand a lot. In this case I may need to do pipelining with chunking in order to avoid memory explosion. But with the discussed suggestion for the pipeline initialization, we don’t have the possibility to do that because files are expected. So we need more input options such as ndarrays, list of ndarrays etc. Overall this becomes too cluttered if we also allow to add pipeline stages and pipeline options. So I suggest to treat all these cases by coordinates.input(…) which is going to be a very flexible function and prepares the first stage for your pipeline.
- TODO: Implement this. I have drafted the code for input in comments. Case 2 ((T, N, 3)-arrays) and maybe also Case 3 are not really urgent, and you can postpone them in favor of more urgent tasks.
coordinates.api.pipeline(…):
- see discussion in coordinates.api.input(…) above
- added a stride option that is only used for parametrisation of the pipeline. If you want to stride the mapping, there is a corresponding option in those functions that retrieve output data from the pipeline (Transformer.get_output()).
- TODO: implement this.
Pipeline.parametrize():
- I have renamed run() (unspecific name) to parametrize(). Perhaps fit() would be nicer because it’s shorter and the spelling is not abiguous?
Transformer.get_output(stride=1):
- TODO: implement this.
coordinates.api.save_traj(...):
coordinates.api.save_trajs(...):
- needed for saving trajectories generated from an MSM or for samples of micro/macrostates.
- TODO: implement this.

franknoe commented 9 years ago

load+input: always provide flattened data. If (Nx3) data wanted, provide a separate function (e.g. load_xyz)
param_stride interpretation: E.g. in TICA remember which stride was used at the time of parametrization.
- Provide @property eigenvalues(self) that uses -(lag*stride)/ln(lambda_i).
- Provide _get_stride() that returns current stride from the data producer.
Provide a save method to Transformer to do get_output into files
- Transformer.save_output(outfiles, stride=1)
- Transformer.saveoutput(outfiles=None, prefix='transform', ext='npy', stride=1)
save_traj(...)
- Use Guille's functions to put output trajectory together.

marscher commented 9 years ago

we should think of a new name for "input" api function, since it shadows a python builtin

franknoe commented 9 years ago

I am closing this issue, because it is done apart from a few small, specific issues (that are explicitly spelled out in separate issues now). Thanks guys!!!