jmborr / idpflex

Analysis of intrinsically disordered proteins by comparing MD simulations to Small Angle Scattering experiments
http://idpflex.readthedocs.io/en/latest/
MIT License
3 stars 4 forks source link

Ribosome Fitting #104

Closed ConnorPigg closed 5 years ago

ConnorPigg commented 5 years ago

Fitting the data associated with the ribosome has been challenging. This issue will collect the discussion, problems, tasks, and solutions. This will simplify where to find the information. Similarly, it will document the information and updates as an example of working with idpflex.

TODO

ConnorPigg commented 5 years ago

Attempting to address robustness by manually changing a model's residual function to penalize probability totals being different from 1. This will be implemented by removing bounds on struct0_prob_c and adding diff *= np.exp(10*abs(1 - ptotal)) to the existing residual function.

If changing the residual function alone does not solve the robustness issue then fitting can be performed repeatedly by adjusting the initial parameters in between fits. The parameters can be updated using something like the following.

upper_bound = 1
for prob in probability_params:
    new_val = random.uniform(0, upper_bound)
    upper_bound -= new_val
    prob.set(value=new_val)
ConnorPigg commented 5 years ago

The Pearson correlation coefficient between probabilities and Rg VMD filled using both sans and saxs data was very low (0.0637 using the vacant experimental fit and similar value for the structured and linear fits) indicating the fitting is not simply choosing the largest Rg value.

jmborr commented 5 years ago

ummm How about only the X-ray data? It has much better resolution at low-Q which is the area important for Rg determination...

ConnorPigg commented 5 years ago

For X-ray only, it was again very low (0.06367) and for sans only it was (0.09762) again using Rg VMD filled from the spreadsheet. This would seem to indicate that the differences in theoretical Rg does not strongly determine the fit.

It may be interesting to use the integer ranking instead to perform the calculation. This would provide more spread to the probability data as opposed to having many clumped near 0 and a few outliers that are orders of magnitude larger. Edit 1: Using the integer ranking brings the correlation up to about (0.23). Still small but identifies a larger impact than the probabilities would suggest. Edit 2: Using integer ranking for both Rg and integer ranking for probabilities brings the correlation up to (0.34). Once again, this is small but compares the orderings which is what was initially desired.

jmborr commented 5 years ago

"integer ranking" ?

ConnorPigg commented 5 years ago

Rescaling

Re-scaling was tested on a single leaf node. The test involved:

  1. fitting without scaling
  2. fitting by scaling the experimental profile and errors
  3. fitting by scaling the theoretical profile

The scaling was done by using a factor scale = max(leaf.y)/max(exp.y).

# The experiment was rescaled by
exp.y *= scale
exp.e *= scale
# while the model was rescaled by
leaf.y /= scale

The fits were then performed to find the slope arguments. The slope from scaling the experiment was divided by the scaling factor and used as the slope in the first's eval function. Similarly, the slope from scaling the model was multiplied by the scaling factor and used in the first's eval function. These slopes were all close in value resulting in similar residuals. The reduced chisqr for (2) and (3) were the same to several digits and ended up being smaller than the reduced chisqr found without scaling. Scaling the experimental data is simpler and is the method chosen for future fitting methods. The scaling factor was then generalized to account for more than one structure using the following scale = sum(max(struct.y) for struct in structs_to_be_fit)/(len(structs_to_be_fit) * max(exp.y)) or the average maximum structure value divided by the maximum experimental value.