Incorrect reading of two openly available test datasets in .ang file format

hakonanes commented 1 year ago

See issues with incorrectly read .ang files into CrystalMap in #411:

Scan units should be "um", not "nm"
Phase IDs of the AF96 dataset are incorrectly read

argerlt commented 1 year ago

Here's the code snippet for loading the AF96 datasets correctly. I don't know enough about the TSL OIM software used to collect this data to know if ALL ebsd scans from tsl can be read like this, or if there are user choices that change the ordering of columns. Also not sure how best to make orix determine the correct phase data, or if that should be left up to orix users to add.

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 10 10:26:48 2022

@author: agerlt
"""

import numpy as np
import glob
from orix.quaternion import Rotation
from diffpy.structure import Atom, Lattice, Structure
from orix.crystal_map import CrystalMap, PhaseList
from orix import io
import os

try:
    os.mkdir("AF96")
except:
    # delete the files from the last run
    old = glob.glob("AF96/AF96_Large*.h5")
    [os.remove(x) for x in old]

angs = glob.glob("AF96_Large*.ang")
xmaps = []

#add iterator for naming files
iterator = 0
for i, ang in enumerate(angs):

    # Load the data
    e1, e2, e3, x, y, image_quality, confidence_index, phase, indexed, \
        fit_parameter = np.loadtxt(ang, unpack=True)

    # Make an orix .h5 file.
    eu = np.column_stack((e1, e2, e3))
    rots = Rotation.from_euler(eu)
    properties = dict(iq=image_quality.astype(np.float32),
                      ci=confidence_index.astype(np.float32),
                      fit_parameter=fit_parameter.astype(np.float32))
    # Create unit cells of the phases
    structures = [
        Structure(
            title="austenite",
            atoms=[Atom("fe", [0] * 3)],
            lattice=Lattice(0.360, 0.360, 0.360, 90, 90, 90),
        ),
        Structure(
            title="ferrite",
            atoms=[Atom("fe", [0] * 3)],
            lattice=Lattice(0.287, 0.287, 0.287, 90, 90, 90),
        ),
    ]

    phase_list = PhaseList(
        names=["austenite", "ferrite"],
        point_groups=["m-3m", "m-3m"],
        space_groups=[225, 229],
        structures=structures,
    )
    # Create a CrystalMap instance
    xmap = CrystalMap(
        rotations=rots,
        phase_id=phase.astype(np.int32),
        x=x.astype(np.float32),
        y=y.astype(np.float32),
        phase_list=phase_list,
        prop=properties,
        )
    xmap
    xmaps.append(xmap)

print("Saving everything as orix .h5 files...")
[io.save("AF96/AF96_Large_{}.h5".format(i+1), xmaps[i]) for i in np.arange(5)]
print("Done!")

argerlt commented 1 year ago

Also, attached is all the information on the collection software, taken from the following paper https://doi.org/10.1016/j.matchar.2019.109835

hakonanes commented 1 year ago

Thanks for pointing me to these test datasets, @argerlt. I've fixed the identified issues in #416, and hope to release it in a 0.10.3 patch next week.

In your code snippet above, you assume the following column names for the .ang file data:

e1, e2, e3, x, y, image_quality, confidence_index, phase, indexed, fit_parameter

I'm not sure about the ninth column, "indexed". What do you base this name on? In the file I've tested, Field of view 1_EBSD data_Raw.ang, this column contains only ones. However, in all other .ang files I've read before, un-indexed points are identified as having a confidence index (CI) of -1 and a pattern fit of 180 degrees. 29 points in the mentioned .ang file has a CI of -1, i.e. are identified as un-indexed. Thus, I believe the ninth column contains some other data. But I have no idea what, so I've named the data "unknown1" in the returned CrystalMap.prop dictionary.

argerlt commented 1 year ago

Your're right. that was a mistake on my part. The correct name is either "SEM signal" or "detector signal" or just "sem", which is left as 1 if there is no corresponding SEM data included.

Looking inside MTEX's .ang reader found here, I believe this case lines up with their description of "version 5" (line 113):

  % we need to guess one of the following conventions
  % Euler 1 Euler 2 Euler 3 X Y IQ CI Phase SEM_signal Fit
  % Euler 1 Euler 2 Euler 3 X Y IQ CI Fit phase
  % Euler 1 Euler 2 Euler 3 X Y IQ CI Fit unknown1 unknown2 phase
  % most important is the position of the phase

  % for future reference:
  % the following is taken from a recent .ang file - some new files might 
  % actually state the version in the header
  %
  % # NOTES: Start
  % # Version 1: phi1, PHI, phi2, x, y, iq (x*=0.1 & y*=0.1)
  % # Version 2: phi1, PHI, phi2, x, y, iq, ci
  % # Version 3: phi1, PHI, phi2, x, y, iq, ci, phase
  % # Version 4: phi1, PHI, phi2, x, y, iq, ci, phase, sem
  % # Version 5: phi1, PHI, phi2, x, y, iq, ci, phase, sem, fit
  % # Version 6: phi1, PHI, phi2, x, y, iq, ci, phase, sem, fit, PRIAS Bottom Strip, PRIAS Center Square, PRIAS Top Strip, Custom Value
  % # Version 7: phi1, PHI, phi2, x, y, iq, ci, phase, sem, fit. PRIAS, Custom, EDS and CMV values included if valid
  % # Phase index: 0 for single phase, starting at 1 for multiphase
  % # CMV = Correlative Microscopy value
  % # EDS = cumulative counts over a specific range of energies
  % # SEM = any external detector signal but usually the secondary electron detector signal
  % # NOTES: End
  %

My two cents: Asking around in my lab, it seems the TSL .ang file format has changed some over the years, as has Oxford's. Thus, when trying to write a generic EBSD_loader, it seems the best practice would be to make a list of all possible formats, then pair it down based on column number, if columns contain integer or float data, etc.

That said, creating a comprehensive "if/then/else" tree for every oddball format sounds exhausting, and so far for me, saying "if 10 columns, assume phi1, Phi, phi2, x, y, iq, ci, phase_id, detector_signal, pattern_fit" has yet to fail, so such a function might only need be included if and when a test case is found that Orix mishandles.

hakonanes commented 1 year ago

[...] it seems the best practice would be to make a list of all possible formats, then pair it down based on column number

This is what the current reader does. See relevant lines in the updated possible columns in #416. Since ASTAR, EMsoft and orix have unique footprints in their .ang file header, if none of these footprints are found, we assume the file was written by EDAX TSL. Then, we determine the column names based on the number of columns available. Reading EDAX TSL .ang files with 10 or 15 columns should now work. The reader will fail as in #411 if a file with another number of columns is read. But it should not fail silently (as demonstrated), so we can improve it further when that happens.

I consider this fixed once #416 is merged.

argerlt commented 1 year ago

Ah, you are right.

In that case, my feedback is just "I believe unknown 1 should be changed to sem". Apologies for the long walk to a short answer.

"If I had had more time I would have written a shorter letter"

Mark Twain

pyxem / orix

Incorrect reading of two openly available test datasets in .ang file format #413