Investigate single-crystal peak finding with CNNs and simulated data

RichardWaiteSTFC commented 10 months ago

Describe the outcome that is desired. See if CNN can be trained on (fairly realistic) simulated WISH data and then applied to real data.

Does the outcome relate directly to a problem? Please describe. Want robust method to find single-crystal peaks, but unrealistic to be able to get ~1000s of labelled datasets.

Describe any solutions you are considering Simulate data with known UB so we know exactly which peaks are present and where.

Here is some code I have knocked up to do this (using a WISH dataset of a powder sample, that according to the title was taken with same resolution settings as typical single-crystal measurement) - no warranty included!

from mantid.simpleapi import *
import matplotlib.pyplot as plt
import numpy as np
from IntegratePeaksSkew import InstrumentArrayConverter, PeakData
from mantid.api import FunctionFactory

# powder taken with MR-SF settings (same typically used for single-crystal)
ws = Load(Filename='/archive/ndxwish/Instrument/data/cycle_23_4/WISH00056081.raw', OutputWorkspace='WISH00056081')
ws = ConvertUnits(InputWorkspace=ws, OutputWorkspace=ws.name(), Target='dSpacing')

# set a UB matrix (taken from PdCrO2 - could use any!)
ub = np.array([[ 0.08749108,  0.37636055, -0.00169743],
               [-0.01788245,  0.00531973,  0.05517389],
               [ 0.383845  ,  0.11677283,  0.00295732]])
SetUB(ws, UB=ub)
# set goniometer (roatets UB matrix by 5 degrees around vertical axis - can add more generate a series of datasets!)
SetGoniometer(ws, Axis0="5,0,1,0,-1")

# predict peaks
peaks = PredictPeaks(InputWorkspace=ws, WavelengthMin=0.8, WavelengthMax=10, MinDSpacing=0.8, MaxDSpacing=15, 
                     ReflectionCondition='Primitive', OutputWorkspace='peaks')

# loop over predicted peaks and simulate data
array_converter = InstrumentArrayConverter(ws)
minI, maxI = 5,50
nfwhm=5
sig_rows = 1
sig_cols = 2
np.random.seed(1)
for ipk, pk in enumerate(peaks):
    print("ipk = ", ipk)
    # get data array in window around peak region
    peak_data = array_converter.get_peak_data(pk, pk.getDetectorID(), peaks.column("BankName")[ipk], 
                                              15, 15, nrows_edge=1, ncols_edge=1)
    ispecs = np.array(ws.getIndicesFromDetectorIDs([int(d) for d in peak_data.detids.flatten()])).reshape(peak_data.detids.shape)
    # get arrays representing distance of each pixel from peak position 
    rows, cols = np.meshgrid(np.arange(ispecs.shape[1])-peak_data.icol, np.arange(ispecs.shape[0])-peak_data.irow)  # cols, rows
    gaussian = (1/(2*np.pi*sig_rows*sig_cols))*np.exp(-0.5*(rows/sig_rows)**2 -0.5*(cols/sig_cols)**2 )
    # estimate parameters
    func = FunctionFactory.Instance().createPeakFunction("BackToBackExponential")
    func.setParameter("X0", pk.getDSpacing())  # set centre
    func.setMatrixWorkspace(ws, int(ispecs[peak_data.irow, peak_data.icol]), 0.0, 0.0)
    func_callable = FunctionWrapper(func)
    # loop over all spectra and simulate peak in given d-spacing range
    fwhm = func.fwhm()
    dmin = pk.getDSpacing() - nfwhm*fwhm
    dmax = pk.getDSpacing() + nfwhm*fwhm
    intensity = minI + np.random.rand()*(maxI-minI)
    for irow in range(ispecs.shape[0]):
        for icol in range(ispecs.shape[1]):
            ispec = int(ispecs[irow, icol])
            # get d-spacing range to evaluate peak
            istart = ws.yIndexOfX(dmin, ispec)
            iend = ws.yIndexOfX(dmax, ispec)
            # get d-spacing values
            dspac = ws.readX(ispec)
            dspac = 0.5*(dspac[istart:iend] + dspac[istart+1:iend+1])  # convert to bin centers from edgess
            # simulate data and add to workspace
            func_callable.setParameter('I', gaussian[irow,icol]*intensity)
            ws.dataY(ispec)[istart:iend] += np.random.poisson(lam=func_callable(dspac))

# convert back to TOF
ConvertUnits(InputWorkspace=ws, OutputWorkspace=ws.name(), Target='TOF')

Peaks are simulated using a BackToBackExponential with realistic parameters (that do depend on d-spacing and scattering angle) with a Gaussian distribution on the detector (currently same width and height for all peaks). Note the peak is simulated in d-spacing and then transformed to TOF - so this will account for peak centre being different in adjacent pixels due to changing scattering angle and L2. The integrated intensity of the peak is randomly generated between two limits. Poisson noise is applied to the simulated data before adding to the workspace.

This can be made more realistic if:

Intensities based on structure factor of real crystals and scaled by incident flux
Width and height of peaks should vary with scattering angle (currently the same for all detectors)

But lets give it a go as is, and see how it performs for real data!

Note additional data sets can be simulated by:

Changing the UB matrix used
Rotating the UB (using SetGoniometer)
Changing the random seed (will only change intensities not peak positions)
Add more noise to the peaks and backgroud
Using a different workspace as a starting point (e.g. powder lines in different places, different background level)
Change the centering (ReflectionCondition) in PredictPeaks (see algorithm documentation)

Once the data has been simulated the workspace can be converted to a numpy array for input to various machine learning codes using this code:

https://github.com/mantidproject/scriptrepository/blob/1712adf4119800aaf771ccd9ca1385c443355f6f/diffraction/WISH/bragg-detect/process_input_and_output_WISH_ML_peak_finding.py

The above file also has code to convert from indexing the array to spectrum index (note in WISH there are 4 detectors for every spectrum - so you should use the spectrum index instead of detector ID when comparing peaks). This will be required when training the CNN to check whether the found peaks exist of not.

Additional context Details of UB matrices and indexing can be found here https://docs.mantidproject.org/v6.1.0/concepts/Lattice.html

RichardWaiteSTFC commented 9 months ago

UB matrices (saved as .mat) - can be loaded onto a workspace by using LoadISawUB algorithm UBs_for_Waruna.zip

RichardWaiteSTFC commented 9 months ago

Additional files to use as background: Vanadium runs (no powder lines) taken with same resolution settings as single-crystal, but they have unusually good stats

55766
56083

Powder runs counted for ~10 muAh, taken with different resolution (HR-SF) compared typical single-crystal experiments (MR-SX). But should still be worth using.

55859 (big sample - so probably quite strong powder lines)
55986 (small sample probably quite weak powder lines)

You can ask WISH team for more runs if you need them - but you can create 1000s of simulations just using the run you have and changing goniometer rotations, UB matrices random seeds, reflection condition/centering as described in the issue

RichardWaiteSTFC commented 9 months ago

To label the data sets you have 2-3 options:

Reverse engineer the functions findSpectrumIndex and createPeaksWorkspaceFromIndices in the script I mentioned above

https://github.com/mantidproject/scriptrepository/blob/1712adf4119800aaf771ccd9ca1385c443355f6f/diffraction/WISH/bragg-detect/process_input_and_output_WISH_ML_peak_finding.py

to go the opposite way (i.e. take information in the peak table - namely DetID and TOF) and convert it to indices in the array. I.e. convert the detector ID to a spectrum index and convert this index to the array indices in the first two dimensions (see findSpectrumIndex) and then use the TOF to get the index along the third dimension.

Or you can simulate the dataset with/without the background run and use a boolean mask (intensity > 0 or small threshold) on the simulation without background to label the data (in this case categorise/classify elements in the array as being peak or non-peak). This may not work as well if the peaks are overlapping.

To get the simulated data only with 0 background, replace this line ws.dataY(ispec)[istart:iend] += np.random.poisson(lam=func_callable(dspac)) with something like

y = np.zeros(ws.dataY(ispec).shape)
y[istart:iend] += np.random.poisson(lam=func_callable(dspac))
ws.setY(ispec, y)

be careful to set the random seed as this might give different answers, or have two workspace (background and non-background) that you add the same simulated data to.

You would need to train the CNN on the simulation with the background, but you could get your labels from the simulation without the background.

Can use (2) but then get the position of the maxima in each contiguous peak region by a combination of scipy.ndimage.label and 'scipy.ndimage.maximum_position`

RichardWaiteSTFC commented 9 months ago

If you want to try this on real data to find peaks, use runs 42730-42733

RichardWaiteSTFC commented 9 months ago

I have adjusted the code to generate the labels at the point the peaks is simulated. It produces a csv file with each row corresponding to a row in the peak table (it contains spectrum index, index of TOF bin/channel for peak centre, min and max). This should be sufficient to train the algorithm to detect peaks in 2D detector images at constant TOF as discussed.

In addition I call Rebunch on the data that increases the bin width by a factor 3 (i.e. there are only 1540 TOF bins) - we should still have multiple TOF bins per peak.

If you want to further reduce the data size you can treat each half of the WISH instrument separately (i.e. panels 1-5 and panels 6-10).

# import mantid algorithms, numpy and matplotlib
from mantid.simpleapi import *
import matplotlib.pyplot as plt
import numpy as np
from IntegratePeaksSkew import InstrumentArrayConverter, PeakData
from mantid.api import FunctionFactory

# powder taken with MR-SF settings (same typically used for single-crystal)
ws = Load(Filename='/archive/ndxwish/Instrument/data/cycle_23_4/WISH00056081.raw', OutputWorkspace='WISH00056081')
# reduce size of data and improve stats - increase TOF bin-width by factor 3 (typically fine for peak finding on WISH)
ws = Rebunch(InputWorkspace=ws, NBunch=3, OutputWorkspace=ws.name())
ws = ConvertUnits(InputWorkspace=ws, OutputWorkspace=ws.name(), Target='dSpacing')

# set a UB matrix (taken from PdCrO2 - could use any!)
ub = np.array([[ 0.08749108,  0.37636055, -0.00169743],
               [-0.01788245,  0.00531973,  0.05517389],
               [ 0.383845  ,  0.11677283,  0.00295732]])
SetUB(ws, UB=ub)
# set goniometer (roatets UB matrix by 5 degrees around vertical axis - can add more generate a series of datasets!)
SetGoniometer(ws, Axis0="5,0,1,0,-1")

# predict peaks
peaks = PredictPeaks(InputWorkspace=ws, WavelengthMin=0.8, WavelengthMax=10, MinDSpacing=0.8, MaxDSpacing=15, 
                     ReflectionCondition='Primitive', OutputWorkspace='peaks')

# loop over predicted peaks and simulate data
# calculate structure factor
array_converter = InstrumentArrayConverter(ws)
minI, maxI = 5,50
nfwhm=6
sig_rows = 1
sig_cols = 2
np.random.seed(1)
# each row in labels
peak_labels = np.zeros((peaks.getNumberPeaks(), 4)) # each row contains [ispec, TOF index of peak cen, TOF index of peak min, TOF index of peak max]
frac_cutoff = 0.1  # cutoff in integrated intensity used to find peak extents for labelling
for ipk, pk in enumerate(peaks):
    print("ipk = ", ipk)
    # get data array in window around peak region
    peak_data = array_converter.get_peak_data(pk, pk.getDetectorID(), peaks.column("BankName")[ipk], 
                                              15, 15, nrows_edge=1, ncols_edge=1)
    ispecs = np.array(ws.getIndicesFromDetectorIDs([int(d) for d in peak_data.detids.flatten()])).reshape(peak_data.detids.shape)
    # get arrays representing distance of each pixel from peak position on the detector/instrument view
    rows, cols = np.meshgrid(np.arange(ispecs.shape[1])-peak_data.icol, np.arange(ispecs.shape[0])-peak_data.irow)  # cols, rows
    gaussian = (1/(2*np.pi*sig_rows*sig_cols))*np.exp(-0.5*(rows/sig_rows)**2 -0.5*(cols/sig_cols)**2 )
    # estimate parameters
    func = FunctionFactory.Instance().createPeakFunction("BackToBackExponential")
    func.setParameter("X0", pk.getDSpacing())  # set centre
    func.setMatrixWorkspace(ws, int(ispecs[peak_data.irow, peak_data.icol]), 0.0, 0.0)
    func_callable = FunctionWrapper(func)
    # loop over all spectra and simulate peak in given d-spacing range
    fwhm = func.fwhm()
    dmin = pk.getDSpacing() - nfwhm*fwhm
    dmax = pk.getDSpacing() + nfwhm*fwhm
    intensity = minI + np.random.rand()*(maxI-minI)
    for irow in range(ispecs.shape[0]):
        for icol in range(ispecs.shape[1]):
            ispec = int(ispecs[irow, icol])
            # get d-spacing range to evaluate peak
            istart = ws.yIndexOfX(dmin, ispec) if dmin > ws.readX(ispec)[0] else 0
            iend = ws.yIndexOfX(dmax, ispec) if dmax < ws.readX(ispec)[0] else ws.blocksize()
            # get d-spacing values
            dspac = ws.readX(ispec)
            dspac = 0.5*(dspac[istart:iend] + dspac[istart+1:iend+1])  # convert to bin centers from edgess
            # simulate data and add to workspace
            func_callable.setParameter('I', gaussian[irow,icol]*intensity)
            ypk = func_callable(dspac)
            if irow == peak_data.irow and icol == peak_data.icol:
                # spectrum corresponding to predicted peak position
                # find peak extents for labeling - use cutoff in integrated intensity
                icen = ws.yIndexOfX(pk.getDSpacing(), ispec)
                y_cdf = np.cumsum(ypk)  # cumulative distribtuion function
                y_cdf /= y_cdf[-1]
                ilo = np.clip(istart + np.argmin(abs(y_cdf - frac_cutoff)), a_min=0, a_max=icen)
                ihi = np.clip(istart + np.argmin(abs(y_cdf - (1-frac_cutoff))), a_min=icen, a_max=ws.blocksize())
                peak_labels[ipk] = [ispec, icen, ilo, ihi]
            ws.dataY(ispec)[istart:iend] += np.random.poisson(lam=ypk)

# convert back to TOF
ConvertUnits(InputWorkspace=ws, OutputWorkspace=ws.name(), Target='TOF')
# save labels to csv
np.savetxt(r"/mnt/babylon/Public/RWaite/labels.txt", peak_labels, fmt='%d', delimiter=",")

warunawickramasingha commented 9 months ago

iend = ws.yIndexOfX(dmax, ispec) if dmax < ws.readX(ispec)[0] else ws.blocksize()

Thank you very much for this code Richard!

Having debugged it, I guess this line

iend = ws.yIndexOfX(dmax, ispec) if dmax < ws.readX(ispec)[0] else ws.blocksize()

should be corrected as below with [-1] index if I am not mistaken?

iend = ws.yIndexOfX(dmax, ispec) if dmax < ws.readX(ispec)[-1] else ws.blocksize()

RichardWaiteSTFC commented 9 months ago

Yep indeed, sorry for typo was in a rush!

RichardWaiteSTFC commented 4 months ago

This was very successful, great work! I'm closing this issue and have put a new issue in for the next steps #37527

mantidproject / mantid

Investigate single-crystal peak finding with CNNs and simulated data #36590