Validation Dataset - Githubissues

qzhu2017 commented 4 years ago

@sayred1

After you implemented the similarity function, I suggest you collect a database for systematic validation. Similar to what you have done for NaCl, select each representative structure from space group 1-230. Collect the experimental PXRD and compute the PXRD from Mercury and our own code. And then compute the similarity.

I will tell you more details on Friday.

qzhu2017 commented 4 years ago

@sayred1 For the first step, I suggest you do the following

Download structures and its pxrd pattern from the place when you get NaCl. Choose each representative structure from space group 1-230.
Compute the PXRD for each structure, and then compute the similarity. Plot the histogram of the similarity values.

I suppose we should get something close to 1 for each values. Let's see.

qzhu2017 commented 4 years ago

@sayred1

http://rruff.geo.arizona.edu/AMS/amcsd.php

a python script to search and download the cif structure and diffraction data
the diffraction data has only intensity (i.e., no profiling is applied)
compute the intensity with our code and compare the results from database.

We will worry about the profiling function later.

sayred1 commented 4 years ago

@qzhu I've finished calculating the similarity over the validation set. In total there are 288/370 structures (not the full set, either because of partial occupation or because of the website issue I ran into).

Here is the histogram of similarities for all structures Screen Shot 2020-01-17 at 5 08 16 PM .

I've also shuffled through the structures, and classified them into their 7 geometric classes. I did this based on their definitions of cell parameters (a, b, c, alpha, beta, gamma). None of the structures from the site were monoclinic, and I've also grouped the trigonal and hexagonal structures together.

At first glance these plots look promising since the distribution of similarities for each class tends to 1.0. I can look further into this if need be.

sayred1 commented 4 years ago

The above was with FWHM = 0.9. When we decrease this, say to 0.2, this is what the distributions look like:

Shortly I'll upload some of my scripts. For now I'll begin looking into that other profiling function, until you open the new issue we discussed during our meeting yesterday.

qzhu2017 commented 4 years ago

@sayred1 There is a serious bug in your similarity script. I tried to analyze the 1st structure which returned the similarity value of 0.19

If you look them by eye, this looks very similar, however, it returned 0.19!!! Then I tried to plot the following

Clearly, there is something wrong with g2_thetas

You similarity code assumes that the g2thetas will be the same. But it is not in this case.

I hope next time you can find such bugs by yourself.

sayred1 commented 4 years ago

Okay, so I've preprocessed one of the compared peaks before calculating the similarity:

This first structure now gives S = 0.9914625113433583 after fixing the bug. Running the script through the whole dataset gives the following histograms:

The orthorhombic peak that returns S < 0.5 looks like this:

Apparently pxrd didn't do a very good job for this one structure. I'll explore this further.

qzhu2017 commented 4 years ago

@sayred1 Did you update the code?

qzhu2017 commented 4 years ago

I just checked this structure Acmm.cif. It is an unusual structure. You can ignore it.

sayred1 commented 4 years ago

Sorry, I forget to press send on this comment.

Yes, all the scripts are now uploaded.

qzhu2017 commented 4 years ago

@sayred1 Why not put the interpolate function to similarity class?

sayred1 commented 4 years ago

@qzhu2017 this has been updated

qzhu2017 commented 4 years ago

@sayred1 I am not happy with the code similarity code. As I said, the code should be as general as possible. The current code for preprocess looks very confusing to me. You need to consider this is a general utility compute the similarity between two spectra. The two spectra may have different sizes and different range of x_values. Then you need to determine

the range of x_values
how many point you want to sample within the x_range
which weight function should be applied.

I am attaching a prototypical code below, please do the new code based on this version.

import numpy as np
import scipy.integrate as integrate
from scipy import interpolate

class Similarity(object):
    """
    Class to compute the similarity between two diffraction patterns
    """

    def __init__(self, f, g, x_range=None, N=None, weight={'function': 'triangle', 'params': 0.6}):

        """
        Args:

        f: spectra1 (2D array)
        g: spectra2 (2D array)
        x_range: the range of x values used to compute similarity ([x_min, x_max])
        N: number of sampling points for the processed spectra
        weight: weight function used to compute the similarity (dictionary)
        """

        self.fx, self.fy = f
        self.gx, self.gy = g
        self.N = N
        self.x_range = x_range
        self.weight = weight

        #self.scaling = scaling

        # QZ: I don't know why you make the following statement
        # It is not really necessary that these two must have the same dimension.
        #try:
        #    self.fpeaks.shape[0] == self.gpeaks.shape[0]
        #except:
        #    print("Patterns are not the same shape")
        # 
        # self.N = fpeaks.shape[0]
        # self.r = np.linspace(-1, 1, self.N) 

        self.preprocess()
        function = self.weight['function'] 
        if function == 'triangle':
            self.triangleFunction()
        else:
            msg = function + 'is not supported'
            raise NotImplementedError(msg)

    def calculate(self):
        """
        Compute the similarity
        """

        fpeaks_r = self.fpeaks
        f2thetas_r = self.f2thetas + self.r
        gpeaks_r = self.gpeaks
        g2thetas_r = self.g2thetas + self.r

        x_fg = np.concatenate((self.f2thetas, g2thetas_r))
        x_fg = np.linspace(x_fg[0],x_fg[-1],self.N)
        x_ff = np.concatenate((self.f2thetas, f2thetas_r))
        x_ff = np.linspace(x_ff[0],x_ff[-1],self.N)
        x_gg = np.concatenate((self.g2thetas, g2thetas_r))
        x_gg = np.linspace(x_gg[0],x_gg[-1],self.N)

        xCorrfg = integrate.trapz(self.fpeaks*gpeaks_r,x_fg)
        aCorrff = integrate.trapz(self.fpeaks*fpeaks_r,x_ff)
        aCorrgg = integrate.trapz(self.gpeaks*gpeaks_r,x_gg)

        xCorrfg_w = integrate.trapz(self.w*xCorrfg, self.r)
        aCorrff_w = integrate.trapz(self.w*aCorrff, self.r)
        aCorrgg_w = integrate.trapz(self.w*aCorrgg, self.r)

        return xCorrfg_w / np.sqrt(aCorrff_w * aCorrgg_w)

    def preprocess(self):
        """
        preprocess the input f ang g spectrum
        we do two steps
        1, determine the range of x values (if not specified, use the overlap between f_x and g_x)
        2, transform f/g to f_new/g_new with the same x values
        """
        inter = interpolate.interp1d(self.f2thetas,self.fpeaks,'cubic',fill_value="extrapolate")
        self.f2thetas = np.linspace(np.min(self.g2thetas),np.max(self.g2thetas),self.N)
        self.fpeaks = inter(self.f2thetas)

    def triangleFunction(self):
        w = np.zeros((self.N))
        l = self.weight['params']
        for i in range(self.r.shape[0]):
            r = np.abs(self.r[i])
            if r < 1:
                tf = lambda r,l : 1 - r/l
                w[i] = tf(r,l)
            else:
                w[i] = 0
        self.w = w

sayred1 commented 4 years ago

@qzhu2017 thanks for the insight, I'll make these corrections.

sayred1 commented 4 years ago

I noticed that you commented out self.r = np.linspace(-1, 1, self.N) in def __init__(). Was this an accident?

qzhu2017 commented 4 years ago

@sayred1 Maybe there should be a parameter called r_range with a default value of [-1, 1]. This way, user can conveniently modify the r values.

sayred1 commented 4 years ago

@qzhu2017 I've generalized similarity.py, let me know if there is something that I left out.

I've also changed the member variable of XRD.get_profile(). Instead of storing self.gpeaks =gpeaks, self.thetas2 = thetas2, it now stores self.pattern = np.vstack((gpeaks,thetas2)).

qzhu2017 commented 4 years ago

Excellent. I think it looks good now.

qzhu2017 / XRD

Validation Dataset #10