jjhelmus / nmrglue

A module for working with NMR data in Python
BSD 3-Clause "New" or "Revised" License
211 stars 86 forks source link

Dubious nmrglue Integration examples. #44

Closed kfritzsc closed 8 years ago

kfritzsc commented 8 years ago

The two examples showing how to do integration in nmrglue don't actually do integration. They simply sum the data around the peak. The dx part is important...

In 1D spectrum (with equally spaced bins) if you sum sections of the data and compare those values the result is very 'integration like' and correct. The idea however is still incorrect. Practically speaking, this approach is problimatic If you start comparing data between multiple data sets (think zero-filing problems). It also makes error analysis more difficult. To fix the problem you simply have to actually integrate: multiply the sum with the bin width:

Example

import numpy as np
​

def gauss(x, x0, s=1):
    """Normalized Gaussian."""
    return 1/(s * np.sqrt(2*np.pi)) * np.exp(-0.5 * ((x-x0)/s)**2)
​
space = 0.2
x = np.arange(-10, 10, space)
y = gauss(x, 0)
​
space1 = 0.1
x1 = np.arange(-10, 10, space1)
y1 = gauss(x1, 0)
​
print('Sum: ',  y.sum(), y1.sum())
print('Integral: ', y.sum() * space, y1.sum() * space1)

Output. ('Sum: ', 5.0000000000000178, 10.000000000000036) ('Integral: ', 1.0000000000000036, 1.0000000000000036)

I would have just fixed the examples and submitted a pull request but wanted to consider two things:

1) What is 'process.proc_base.integ' being used for? Now it is just an aliases for np.cumsum(data, axis=-1). 2) I think there should be a function to do integration in nmrglue. It is something that most users will need. At minimum I think the function needs x (or some proxy), and y data and two x limits. Something like this would work...

def integral_ppm(data, unit_conv, lower, upper):
    """
    Integral of data within lower and upper limits defined in ppm.
    """
    Parameters
    ----------
    data : ndarray
       Array of NMR data.
    data : unit_converter
    lower: float
        lower limit of integration range in ppm
    upper: float
        upper limit of integration range in ppm
    Returns
    -------
    integral : float

Ideas?

neuroglobin commented 8 years ago

Hi, Simple sum is a routine method of doing "integration". For example, NMRPipe measures peak volumes in this way, Sparky also provides "Sum over box" and "Sum over eclipse" modes although they are not default options. I think nmrglue just does the same thing.

It seems to me that your integration method normalizes peak intensities to 1.0, which will help quantitatively comparison between spectra collected/processed differently. Is my understanding of this correct? Will you correct/scale your integrated areas/volumes by the peak heights (that are not affected significantly by zero-filling or binning)?

Thanks.

kfritzsc commented 8 years ago

I agree that the simple sum method is common and works within a single spectrum.

I propose no re-normalization. I just purpose using the definition of an integral.

If integration is done properly spectra with different bin widths (as a result of different zero-filling or dwell times, ect.) can be compared quantitatively. I do not purpose to do anything with the max intensity in the range (i.e. peak hight).

I would get the same, correct, result by printing the spectrum on paper, cutting out the peaks with scissors and then weighing the peaks with an analytical balance.

atomman commented 8 years ago

I think it is a good point. For me I would prefer the unit_conv simply to be the ppm scale for the data. I realize this It is a bit hacky to produce this scale at the moment, especially for bruker data, but that is a separate problem

kfritzsc commented 8 years ago

Thank you for the support! It would be pretty simple to write a function that takes either the unit_converter or the ppm scale as an array, one try / except AttributeError block would do.

I hope PR #43 fixes some of the ppm scale problems for Bruker data.

jjhelmus commented 8 years ago

@kfritzsc Very interesting discussion, I agree that some additional functions on this topic could be added to nmrglue.

What is and isn't "integration" very quickly descends into semantics which I do not wish to spend to much time arguing one way or the other. More practically, finding the intensity of a peak by summing over a box or rectangular region is often used in NMR to analyse a set of spectra collected with identical parameters, for example in the fitting t1 data example. In such cases, the underlying spacing need not be considered. From email I get about nmrglue and the description of how the package is used in journal articles, these "integration" examples seem to be exactly what many scientists are looking for and are quite often used with only minor modifications.

As for the more specific questions:

1) What is 'process.proc_base.integ' being used for? Now it is just an aliases for np.cumsum(data, axis=-1).

This performs "intergration" in the same manner as NMRPipe's INTEG function, by cumulative sum. The units of the integral are points (where dx is 1) which are used in nearly all the function in proc_base. This could be more explicitly explained in the docstring. Using this for analysis is probably not best, but this type of function is sometimes used to process data.

2) I think there should be a function to do integration in nmrglue. It is something that most users will need. At minimum I think the function needs x (or some proxy), and y data and two x limits. Something like this would work...

Agreed, having this function somewhere in the analysis folder is probably best. You may want to think about including support for units other than PPM such as hertz and seconds as well as finding a way to support multi-dimensional integration. The unit_conversion class might be helpful here as it is designed to convert various NMR units to an index which can be used to slice the data:

>> >uc = ng.pipe.make_uc(dic, data)
>>> print(uc("100 ppm"))
411
>>> uc("650 Hz")
768
>>> print(uc("50%"))
750
>>> print(uc("5000 us")  # time units are not too helpful in the frequency domain but can be used...
250
>>> print(uc("5 ms"))
250
jjhelmus commented 8 years ago

Closed by #45