qzhu2017 / XRD

X-ray diffraction calculations
MIT License
7 stars 6 forks source link

Validation Dataset #10

Closed qzhu2017 closed 4 years ago

qzhu2017 commented 4 years ago

@sayred1

After you implemented the similarity function, I suggest you collect a database for systematic validation. Similar to what you have done for NaCl, select each representative structure from space group 1-230. Collect the experimental PXRD and compute the PXRD from Mercury and our own code. And then compute the similarity.

I will tell you more details on Friday.

qzhu2017 commented 4 years ago

@sayred1 For the first step, I suggest you do the following

I suppose we should get something close to 1 for each values. Let's see.

qzhu2017 commented 4 years ago

@sayred1

http://rruff.geo.arizona.edu/AMS/amcsd.php

We will worry about the profiling function later.

sayred1 commented 4 years ago

@qzhu I've finished calculating the similarity over the validation set. In total there are 288/370 structures (not the full set, either because of partial occupation or because of the website issue I ran into).

Here is the histogram of similarities for all structures Screen Shot 2020-01-17 at 5 08 16 PM.

I've also shuffled through the structures, and classified them into their 7 geometric classes. I did this based on their definitions of cell parameters (a, b, c, alpha, beta, gamma). None of the structures from the site were monoclinic, and I've also grouped the trigonal and hexagonal structures together.

Screen Shot 2020-01-17 at 5 10 57 PM

At first glance these plots look promising since the distribution of similarities for each class tends to 1.0. I can look further into this if need be.

sayred1 commented 4 years ago

The above was with FWHM = 0.9. When we decrease this, say to 0.2, this is what the distributions look like:

Screen Shot 2020-01-17 at 5 52 14 PM

Shortly I'll upload some of my scripts. For now I'll begin looking into that other profiling function, until you open the new issue we discussed during our meeting yesterday.

qzhu2017 commented 4 years ago

@sayred1 There is a serious bug in your similarity script. I tried to analyze the 1st structure which returned the similarity value of 0.19

image

If you look them by eye, this looks very similar, however, it returned 0.19!!! Then I tried to plot the following

image

Clearly, there is something wrong with g2_thetas

image

You similarity code assumes that the g2thetas will be the same. But it is not in this case.

I hope next time you can find such bugs by yourself.

sayred1 commented 4 years ago

Okay, so I've preprocessed one of the compared peaks before calculating the similarity:

Screen Shot 2020-01-18 at 4 49 03 PM

This first structure now gives S = 0.9914625113433583 after fixing the bug. Running the script through the whole dataset gives the following histograms:

Screen Shot 2020-01-18 at 4 58 43 PM Screen Shot 2020-01-18 at 4 58 55 PM

The orthorhombic peak that returns S < 0.5 looks like this:

Screen Shot 2020-01-18 at 5 03 37 PM

Apparently pxrd didn't do a very good job for this one structure. I'll explore this further.

qzhu2017 commented 4 years ago

@sayred1 Did you update the code?

qzhu2017 commented 4 years ago

I just checked this structure Acmm.cif. It is an unusual structure. You can ignore it.

sayred1 commented 4 years ago

Sorry, I forget to press send on this comment.

Yes, all the scripts are now uploaded.

qzhu2017 commented 4 years ago

@sayred1 Why not put the interpolate function to similarity class?

sayred1 commented 4 years ago

@qzhu2017 this has been updated

qzhu2017 commented 4 years ago

@sayred1 I am not happy with the code similarity code. As I said, the code should be as general as possible. The current code for preprocess looks very confusing to me. You need to consider this is a general utility compute the similarity between two spectra. The two spectra may have different sizes and different range of x_values. Then you need to determine

I am attaching a prototypical code below, please do the new code based on this version.

import numpy as np
import scipy.integrate as integrate
from scipy import interpolate

class Similarity(object):
    """
    Class to compute the similarity between two diffraction patterns
    """

    def __init__(self, f, g, x_range=None, N=None, weight={'function': 'triangle', 'params': 0.6}):

        """
        Args:

        f: spectra1 (2D array)
        g: spectra2 (2D array)
        x_range: the range of x values used to compute similarity ([x_min, x_max])
        N: number of sampling points for the processed spectra
        weight: weight function used to compute the similarity (dictionary)
        """

        self.fx, self.fy = f
        self.gx, self.gy = g
        self.N = N
        self.x_range = x_range
        self.weight = weight

        #self.scaling = scaling

        # QZ: I don't know why you make the following statement
        # It is not really necessary that these two must have the same dimension.
        #try:
        #    self.fpeaks.shape[0] == self.gpeaks.shape[0]
        #except:
        #    print("Patterns are not the same shape")
        # 
        # self.N = fpeaks.shape[0]
        # self.r = np.linspace(-1, 1, self.N) 

        self.preprocess()
        function = self.weight['function'] 
        if function == 'triangle':
            self.triangleFunction()
        else:
            msg = function + 'is not supported'
            raise NotImplementedError(msg)

    def calculate(self):
        """
        Compute the similarity
        """

        fpeaks_r = self.fpeaks
        f2thetas_r = self.f2thetas + self.r
        gpeaks_r = self.gpeaks
        g2thetas_r = self.g2thetas + self.r

        x_fg = np.concatenate((self.f2thetas, g2thetas_r))
        x_fg = np.linspace(x_fg[0],x_fg[-1],self.N)
        x_ff = np.concatenate((self.f2thetas, f2thetas_r))
        x_ff = np.linspace(x_ff[0],x_ff[-1],self.N)
        x_gg = np.concatenate((self.g2thetas, g2thetas_r))
        x_gg = np.linspace(x_gg[0],x_gg[-1],self.N)

        xCorrfg = integrate.trapz(self.fpeaks*gpeaks_r,x_fg)
        aCorrff = integrate.trapz(self.fpeaks*fpeaks_r,x_ff)
        aCorrgg = integrate.trapz(self.gpeaks*gpeaks_r,x_gg)

        xCorrfg_w = integrate.trapz(self.w*xCorrfg, self.r)
        aCorrff_w = integrate.trapz(self.w*aCorrff, self.r)
        aCorrgg_w = integrate.trapz(self.w*aCorrgg, self.r)

        return xCorrfg_w / np.sqrt(aCorrff_w * aCorrgg_w)

    def preprocess(self):
        """
        preprocess the input f ang g spectrum
        we do two steps
        1, determine the range of x values (if not specified, use the overlap between f_x and g_x)
        2, transform f/g to f_new/g_new with the same x values
        """
        inter = interpolate.interp1d(self.f2thetas,self.fpeaks,'cubic',fill_value="extrapolate")
        self.f2thetas = np.linspace(np.min(self.g2thetas),np.max(self.g2thetas),self.N)
        self.fpeaks = inter(self.f2thetas)

    def triangleFunction(self):
        w = np.zeros((self.N))
        l = self.weight['params']
        for i in range(self.r.shape[0]):
            r = np.abs(self.r[i])
            if r < 1:
                tf = lambda r,l : 1 - r/l
                w[i] = tf(r,l)
            else:
                w[i] = 0
        self.w = w
sayred1 commented 4 years ago

@qzhu2017 thanks for the insight, I'll make these corrections.

sayred1 commented 4 years ago

I noticed that you commented out self.r = np.linspace(-1, 1, self.N) in def __init__(). Was this an accident?

qzhu2017 commented 4 years ago

@sayred1 Maybe there should be a parameter called r_range with a default value of [-1, 1]. This way, user can conveniently modify the r values.

sayred1 commented 4 years ago

@qzhu2017 I've generalized similarity.py, let me know if there is something that I left out.

I've also changed the member variable of XRD.get_profile(). Instead of storing self.gpeaks =gpeaks, self.thetas2 = thetas2, it now stores self.pattern = np.vstack((gpeaks,thetas2)).

qzhu2017 commented 4 years ago

Excellent. I think it looks good now.