jmborr / idpflex

Analysis of intrinsically disordered proteins by comparing MD simulations to Small Angle Scattering experiments
http://idpflex.readthedocs.io/en/latest/
MIT License
3 stars 4 forks source link

Allow arbitrary mathematical models for each property when fitting #103

Open ConnorPigg opened 5 years ago

ConnorPigg commented 5 years ago

Generalized Fitting Models

Project: idpflex Author: Connor Pigg Date: 25-June-2019

Summary

The proposed change is to the fitting interface. The interface should allow arbitrary models to be applied to a property for fitting. Currently, only linear models with a potential centering parameter are supported. The change should support applying exponential, Gaussian, etc. models for a particular parameter. By generalizing the current interface, in the future, the number of models can be easily extended or developed by users.

Goals

Provide a utility for fitting idpflex data structures using arbitrary sets of lmfit models. The solution should be simpler than what a user can achieve by interacting directly with lmfit. This can be achieved since idpflex can leverage details of data structures which may be unfamiliar to users (tree, PropertyDict, Parameters, etc.). Example use cases include fitting a tree of properties or fitting to an arbitrary set of property groups.

Non Goals

The solution should not implement a new parameter, model, or fitting interface and/or structure. The solution should not hide the structure or process internally and should make the effort to expose the model, parameters, and fitting to the user. The solution for multiple structures will be a linear combination of the structures. The models applied to properties will expect independence from the other properties. Multiple properties will be concatenated to create a feature vector.

Proposed Design

The design will be layered. First, a generic solution for creating a model of (potentially) multiple properties will be developed. Second, a multi-structure model can be made by linearly combining the multi-property models. Finally, a function for modelling and fitting a tree will be described.

Model Requirements

For consistency, all models being applied to the properties should take a keyword argument prop during initialization. For consistency, all models being applied to the properties should have a single independent parameter named x. This somewhat breaks the Non Goal of changing the model interface by putting restrictions on models. An alternative approach would be nice. An example model is below.

def line(x, slope, intercept, prop=None):
    return slope*prop.y + intercept
LinearModel = lmfit.Model(line)

Multiple Property Fitting

The first aspect of the proposed design is to have a function that will accept a PropertyDict and a dictionary/list of lmfit models. This will provide a simple interface for fitting multiple properties of a single structure with arbitrary models. The container of models should have a model for each property. In the case of a single model instead of a container of models, the same model will be applied to each property. The function will apply the appropriate model to each property and create a composite model of the concatenation of these models. The composite model will have a complete set of parameters for all of the sub-models which will be prefixed by the property name.

An example which minimizes the following equations using optional weights. sans_ws*((sans_slope*sansProp + sans_intercept) - exp_sansProp) and saxs_ws*(saxs_c - exp_saxsProp).

properties = PropertyDict([sansProp, saxsProp])
exp_properties = PropertyDict([exp_sansProp, exp_saxsProp])
models = [LinearModel, ConstantModel]
multiproperty_model = create_model_from_property_group(properties, models, ws=None)
# Yielding
# multiproperty_model.make_params() == Parameters([Parameter('sans_slope', ...),
#                                                  Parameter('sans_intercept', ...),
#                                                  Parameter('saxs_c', ...)])
# The parameter constraints, values, etc could then be changed.
params = multiproperty_model.make_params()
for param in params:
  if 'intercept' in param.name:
    pass
# Which can be fit using
multiproperty_fit = multiproperty_model.fit(exp_properties.feature_vector,
                                            x=exp_properties.feature_domain,
                                            weights=None,  # could be changed
                                            method='leastsq',  # could be changed
                                            params=params)

Multiple Structure Fitting

The above will be used as an internal building block for multi-structure fitting. The goal will be to provide a function that will take a list of property structures and a container of models. The container of models will be directly passed to the function described above. The result will be a single model that is composed of the output of the above. The parameters that are in common across structures will be linked to have the same shared value. Additionally, the structures will be linearly combined using probabilities that sum to one.

Note: Each sub-model must have unique parameter names as required by lmfit which creates this large number of redundant parameters.

An example: An example which minimizes the following equations using optional weights.

Tree Fitting

Tree fitting will use multi-structure fitting at each depth in the same procedure as currently available in idpflex. The function will take a tree filled with PropertyDict and a list of models, one for each property. It will output a list of multi-structure models described above. A utility fitting function (the same as currently available) can be used to fit every depth of the tree.

Additional Considerations

Parameter Initializations

During what step should parameters be initialized, bounded, etc? Probabilities can be set to equal for all structures. What responsibility does idpflex have for initializing parameters (lmfit leaves it to the user/model creator). Values can (should?) be applied at function definition def func(x, slope=1, intercept=0, prop=None):.

Should all models be required to implement a guess method? This could potentially simplify initialization but increases the difficulty for users to create custom models.

Parameter Adjustments

These methods will potentially create complex models with sets of complicated parameters. It may be useful to create utility functions for working with these parameters similar to the utility fitting function which maps over models. For example, mapping over all parameters and setting the min value of slopes to 0 or setting all constants to not vary. new_params = idpflex.bayes.apply_to_constants(params, vary=False) This is likely unnecessary and tricky to generalize. A user should be able to iterate over the parameters themselves. Instead, examples, a tutorial, or documentation can be provided somewhere to demonstrate.

Model Creation

Since the proposed model interface requires independent variable 'x' and a keyword argument 'prop' the models in lmfit.models will be unavailable to users. This can be remedied by duplicating these in idpflex in compatible forms. However, this would inflate the codebase and is not likely to be widely used. Instead, a handful of property compatible models (Linear, Constant, Gaussian, etc.) can be provided by lmfit and serve as examples of model creation.

It would also be possible for the model creation methods to accept functions or lambdas directly to prevent the pattern below.

def func(x, ..., prop=None):
  return ...

NewModel = lmfit.Model(func)
create_model = create_model_from_property_group(properties, NewModel)

Instead, the following would be allowed.

def func(x, ..., prop=None):
  return ...

create_model = create_model_from_property_group(PropertyDict, func)
# or
create_model = create_model_from_property_group(PropertyDict, lambda x, ..., prop: ...)

Structure Combination

The proposed solution exclusively supports combining structures linearly with a variable 'probability'. There may be other desired methods for combining structures (quadratic?) but these are outside of the scope of the proposed solution. This could be achieved in a hack-y fashion by creating "new" structures in the desired fashion and run those through the multi-structure fitting. Furthermore, the probability parameters are exposed to the user allowing customization.

ConnorPigg commented 5 years ago

@jmborr Here is my outlined proposal for approaching the generic modeling. Do you have any comments or suggestions for the implementation or final interfaces?