rs-station / careless

Merge X-ray diffraction data with Wilson's priors, variational inference, and metadata
MIT License
16 stars 6 forks source link

Feedbacks to your preprint #4

Closed biochem-fan closed 1 year ago

biochem-fan commented 3 years ago

Hi, I really enjoyed your preprint! An homage to AIMLESS is also nice :)

It is impressive to see that a neural network and variational inference framework can "learn" a scaling model that performs almost as good as existing programs with hand-implemented physical models. I am looking forward to seeing if it can perform even better in future.

Below are my comments.

Introduction (Section 1)

scale parameters are learned per image

This is not necessarily true. Scale factors might depend not only on time (i.e. image no = goniometer angle) but also on the position on the detector, for example.

AIMLESS uses spherical harmonics and smoothly time varying functions.

XSCALE decomposes the scaling function into a product of three functions defined on 2D grids. http://xds.mpimf-heidelberg.mpg.de/html_doc/xscale_parameters.html#CORRECTIONS=

  1. DECAY: image number vs resolution
  2. MODULATION: spot position on the detector (X vs Y)
  3. ABSORPTION: image number vs 13 detector positions

Model complexity

How did you decide the structure of the neural network, such as the number of layers? Is it by cross validation? Should it be adjusted depending on the size and complexity of the dataset (e.g. single sweep or a few crystals or tens of thousands of crystals)?

Choice of metrics

CC1/2

Was the network trained independently for each half set or not? In cryo-EM, it is common to split half sets at the beginning and optimize parameters independently, while in crystallography this is not necessarily done. It is worth mentioning your choice.

Anomalous half dataset correlation

This metric is very sensitive to systematic bias. Since you have an atomic model, anomalous correlation between scaled intensities and intensities calculated from the model might be better.

Comparison with existing approaches

Can you compare your results with those from existing programs? You might not have Precognition at hand but running cxi.merge with and without post-refinement and partiality correction on the XFEL dataset should be trivial.

Model interpretation (Section 4.1)

Can you interrogate the neural network by for example plotting its output when position on the detector is changed while other metadata are kept fixed?

kmdalton commented 3 years ago

Welcome @biochem-fan, and thanks for some seriously helpful feedback! We're looking to do a revised version of the manuscript shortly, so your comments are timely and useful. It's also neat to be getting feedback through GitHub which is definitely a first for me.

Introduction (Section 1)

I will try to beef up this section by including more details such as the spherical harmonic absorption corrections. This is decidely a weak point. Thanks!

Model complexity

This part has been rather ad hoc. The neural net that careless uses is a very simple multi-layer perceptron. I haven't tried to do anything fancy with the architecture yet. Although that is something I am actively working on as I attempt to extend the model to address larger datasets. Regardless, I determined early on that 20 layers was sufficient to scale all of my test data sets which vary dramatically in terms of size and experiment type. I have tried larger nets, and anecdotally they do not seem prone to overfitting. Although overfitting is a legitmate concern, I haven't encountered it in practice yet. I will think about adding a section which explores number of layers against vs crossvalidation performance. I don't think that should be too difficult to include in a future version.

CC1/2

Our CC1/2 values are computed by first dividing the data in two and training the model on each half independently. In that sense they are likely to be more pessimistic than the classic CC1/2 values you get from conventional scaling and merging programs.

Anomalous half dataset correlation

This is simple enough to add. I'll have a go. Thanks!

Comparison with existing approaches

I am planning to run cxi.merge and/or prime before the next draft. I wouldn't be shocked if these sort of domain expert programs outperform careless. It will be nice to understand that though.

Regarding Precognition, we do currently have a license and might explore running it. Because it is basically unheard of to run Precognition without an I/Sigma cutoff, I'd worry about how to make a fair comparison with careless. I'd be happy to entertain any thoughts on the matter.

Model interpretation (Section 4.1)

I could imagine playing some games like that. It'd be a decent amount of work to implement right now, so it's not super high priority for me. In the long term, I am less interested in trying to "open the black box" and understand what the multi-layer perceptron is doing and more interested in replacing it with more transparent physical models.

Thanks again! Have a nice weekend.

biochem-fan commented 3 years ago

I have tried larger nets, and anecdotally they do not seem prone to overfitting. Although overfitting is a legitimate concern, I haven't encountered it in practice yet.

This is an interesting observation.

A related note, not about overfitting, but on the uniqueness of the solution: Scaling has many equivalent (degenerate) solutions. For example, you can multiply all scaling factors with a constant, or apply a B factor. This only changes the scale of the merged intensities and all solutions are equally valid. To make the solution unique, traditional scaling programs arbitrarily fix the scale and B factor of one input (e.g. the first frame or the first crystal). Does CARELESS have this degeneracy?

In other applications of neural networks, it is often said/observed (but not proven?) that SGD is only a local optimizer but finds one of many similarly good solutions. So probably this does not matter in practice, but worth keeping in mind when comparing maps or anomalous peak heights. Maps look very different at different B factors (consider blurring a 1.0 A map with B factor of 100; it will look like a 3 A map). Anomalous peak height (sigma level) is a real space metric, so it also depends on the B factor. High B factors usually lead to lower sigma levels (because of peak blurring) but sometimes increase it if they suppress noisy high resolution components.

In the long term, I am less interested in trying to "open the black box" and understand what the multi-layer perceptron is doing and more interested in replacing it with more transparent physical models.

Yes, your variational inference framework naturally allows combination of physical models and black-box (neural network) models. Considering what is better implemented as physical models and what should be left to be learned from data is a very interesting question for future.

kmdalton commented 3 years ago

Because we impose Wilson's prior on the structure factors, we avoid the degeneracy inherent in simultaneously optimizing the merged structure factors along with the scale function. Therefore, careless only has degeneracy to the extent that neural networks have degenerate local minima. Global optimization of non-convex objectives is a hard problem, and in the case of neural nets it is an active area of research. I'm unsure what the current "state of the art" perspective is on the optimization landscape of feed-forward neural nets. Irrespective, it is not de rigueur to fix any parameters during optimization of MLPs. My preference is to wait for the theory to settle down before making any strong claims about optimality.

That being said, I don't believe that the scaling loss function, even in the least squares case, has ever been shown to be convex. So, there have never been any guarantees of global optimality even with the parameters carefully fixed to avoid degeneracy.

When processing serial data for the preprint, I did find it necessary to add per-image scale parameters. These are implemented in the classical way with one of the scales arbitrarily fixed to one. However, it is also possible to break the degeneracy by imposing a prior distribution over scale factors. This is an experimental feature which is currently under development. In the future, I hope to provide a flexible set of priors on image scale distributions which beamline scientists can tailor for each end station by cross-validation.