EnumerateStructureTransformation does not return structures with correct stoichiometries

ardunn commented 6 years ago

Describe the bug Using EnumerateStructureTransformation with the max_disordered_sites parameter specified does not return structures with correct stoichometry. For example, with the attached cif file for Cu2.8Te2 the output structures should be Cu7Te5; however, the output structures are Cu3Te2.

To Reproduce

from pymatgen import Structure
from pymatgen.transformations.advanced_transformations import EnumerateStructureTransformation
from pymatgen.alchemy.transmuters import StandardTransmuter

s = Structure.from_file("cu7te5.cif")
print("Original Structure")
print(s)

orderer = EnumerateStructureTransformation(max_cell_size=None, max_disordered_sites=100)
stm = StandardTransmuter.from_structures([s], [orderer], extend_collection=10)
structures = stm.transformed_structures
for s in structures:
    print(s.final_structure)

The original structure is:

Original Structure
Full Formula (Cu2.8 Te2)
Reduced Formula: Cu2.8Te2
abc   :   3.980000   3.980000   6.120000
angles:  90.000000  90.000000  90.000000
Sites (6)
  #  SP           a     b      c
---  --------  ----  ----  -----
  0  Cu        0.75  0.25  0
  1  Cu        0.25  0.75  0
  2  Cu:0.400  0.25  0.25  0.27
  3  Cu:0.400  0.75  0.75  0.73
  4  Te        0.25  0.25  0.715
  5  Te        0.75  0.75  0.285

The output structures are:

Full Formula (Cu3 Te2)
Reduced Formula: Cu3Te2
abc   :   3.980000   3.980000   6.120000
angles:  90.000000  90.000000  90.000000
Sites (5)
  #  SP       a     b      c
---  ----  ----  ----  -----
  0  Cu    0.75  0.25  1
  1  Cu    0.25  0.75  0
  2  Cu    0.25  0.25  0.27
  3  Te    0.25  0.25  0.715
  4  Te    0.75  0.75  0.285

Full Formula (Cu6 Te4)
Reduced Formula: Cu3Te2
abc   :   5.628570   5.628570   7.300329
angles:  67.325145  67.325145  90.000000
Sites (10)
  #  SP         a       b      c
---  ----  ------  ------  -----
  0  Cu    0.25    0       0
  1  Cu    0.75    0.5     1
  2  Cu    0.75    0       0
  3  Cu    0.25    0.5     1
  4  Cu    0.865   0.615   0.27
  5  Cu    0.635   0.885   0.73
  6  Te    0.1425  0.8925  0.715
  7  Te    0.6425  0.3925  0.715
  8  Te    0.3575  0.6075  0.285
  9  Te    0.8575  0.1075  0.285
...

Expected behavior The final structures should be Cu7Te5, not Cu3Te2. Note that using EnumerateStructureTransformation(max_cell_size=5) returns correct stoichiometries but the code above does not.

Desktop (please complete the following information):

OS: OSX 10.13.6
Python: 3.6.2
Pymatgen version: 2018.9.12

Additional context The cu7te5.cif file is:

##############################################################################
#                                                                            #
# Cu-Te            # Cu1.4Te ht1                                   # 2040845 #
#                                                                            #
##############################################################################
#                                                                            #
#                           Pearson's Crystal Data                           #
#      Crystal Structure Database for Inorganic Compounds (on DVD)           #
#                              Release 2017/18                               #
#                  Editors: Pierre Villars and Karin Cenzual                 #
#                                                                            #
#   Copyright (c) ASM International & Material Phases Data System (MPDS),    #
# Switzerland & National Institute for Materials Science (NIMS), Japan, 2017 #
#                   All rights reserved. Version 2017.08                     #
#                                                                            #
#   This copy of Pearson's Crystal Data is licensed to:                      #
#   University of Alberta, Chemistry Department, 1-5 Installations License         #
#                                                                            #
##############################################################################

data_2040845
_audit_creation_date                     2018-09-07
_audit_creation_method
;
Pearson's Crystal Data browser
;
#_database_code_PCD                      2040845

# Entry summary

_chemical_formula_structural             'Cu~1.4~ Te'
_chemical_formula_sum                    'Cu1.40 Te'
_chemical_name_mineral                   rickardite
_chemical_compound_source                synthetic
_chemical_name_structure_type            Cu~2~Sb,tP6,129
_chemical_formula_weight                 216.6

# Bibliographic data

_publ_section_title
'Crystal structure of rickardite, Cu~4-x~Te~2~'                               
_journal_coden_ASTM                      AMMIAY
_journal_name_full                       'Am. Mineral.'
_journal_year                            1949
_journal_volume                          34
_journal_page_first                      441
_journal_page_last                       451
_journal_language                        English
loop_
 _publ_author_name
 _publ_author_address
'Forman S.A.'
;
Toronto University
Toronto
Canada
;
'Peacock M.A.'
;
Toronto University
Toronto
Canada
;

# Standardized crystallographic data

_cell_length_a                           3.98
_cell_length_b                           3.98
_cell_length_c                           6.12
_cell_angle_alpha                        90
_cell_angle_beta                         90
_cell_angle_gamma                        90
_cell_volume                             96.94
_cell_formula_units_Z                    2
_space_group_IT_number                   129
_space_group_name_H-M_alt                'P 4/n m m (origin choice 2)'
loop_
 _space_group_symop_id
 _space_group_symop_operation_xyz
 1 'x, y, z'
 2 '1/2-x, 1/2-y, z'
 3 '1/2-x, y, z'
 4 '-x, -y, -z'
 5 '-x, 1/2+y, -z'
 6 '1/2-y, 1/2-x, z'
 7 '1/2-y, x, z'
 8 '-y, -x, -z'
 9 '-y, 1/2+x, -z'
 10 '1/2+x, -y, -z'
 11 '1/2+x, 1/2+y, -z'
 12 'x, 1/2-y, z'
 13 '1/2+y, -x, -z'
 14 '1/2+y, 1/2+x, -z'
 15 'y, 1/2-x, z'
 16 'y, x, z'
loop_
 _atom_type_symbol
 Cu
 Te
loop_
 _atom_site_label
 _atom_site_type_symbol
 _atom_site_symmetry_multiplicity
 _atom_site_Wyckoff_symbol
 _atom_site_fract_x
 _atom_site_fract_y
 _atom_site_fract_z
 _atom_site_occupancy
 Cu2 Cu 2 c 0.25 0.25 0.27 0.4
 Te Te 2 c 0.25 0.25 0.715 1
 Cu1 Cu 2 a 0.75 0.25 0 1

_exptl_crystal_colour                    'purple red'
_exptl_crystal_density_meas              ?
_exptl_crystal_density_diffrn            7.42
_cell_measurement_temperature            ?
_cell_measurement_radiation              'X-rays, Cu Ka1'
_cell_measurement_wavelength             1.5405
_pd_proc_wavelength                      1.5405
_cell_measurement_reflns_used            ?
_diffrn_ambient_temperature              ?
_diffrn_measurement_device               film
_diffrn_measurement_device_type          ?
_diffrn_radiation_type                   'X-rays, Cu Ka1'
_diffrn_radiation_wavelength             1.5405
_diffrn_reflns_number                    ?
_exptl_absorpt_coefficient_mu            ?
_exptl_absorpt_correction_type           ?
_computing_structure_solution            'starting values from the literature'
_refine_ls_number_parameters             ?
_refine_ls_number_reflns                 ?
_refine_ls_R_factor_gt                   ?
_refine_ls_wR_factor_gt                  ?
_pd_proc_ls_proof_R_factor               ?
_pd_proc_ls_proof_wR_factor              ?
_refine_ls_R_I_factor                    ?

# End of data set 2040845

ardunn commented 6 years ago

For reference, the topic on the pymatgen help group

mkhorton commented 6 years ago

I'm a bit confused by this too. It seems clear the first ordering sets one Cu:0.4 site to 0, and the other Cu:0.4 site to 1, which is not completely unreasonable but obviously the overall composition is wrong.

I see this is @computron's code/Anubhav has already commented on the Google Groups thread, do you have any ideas where the rounding from 0.4->0.5 occupancy is happening before I take a closer look at this? Seems like this must be happening in EnumlibAdaptor.

computron commented 6 years ago

The max_disordered_sites option is supposed to be a way to help you find a good max_cell_size parameter. Rather than a single max_cell_size, it keeps incrementing max_cell_size (starting small and getting higher) until EnumlibAdapter returns a solution. If EnumlibAdapter finds any solution, the iteration will quit. Thus, it tries to find the minimum max_cell_size that produces some solution from EnumlibAdapter, thus saving large computation time from having max_cell_size cranked up too high.

For this structure, you need a cell size of at least 5 in order to get the occupancies to work out. However, EnumlibAdapter already starts returning structures (with the wrong composition, Cu3Te2) at max_cell_size=2. e.g. running the code with EnumerateStructureTransformation(max_cell_size=2) will produce the same results.

Clearly, the assumption that the point at which EnumlibAdapter produces any solution is a good "stopping" point for the max_cell_size parameter is incorrect. Actually, I have no idea why EnumlibAdapter doesn't produce any solution for max_cell_size=1 but does so for max_cell_size=2.

Anyone with more idea of the inner workings of when EnumlibAdapter does and doesn't return a solution could probably help in designing a better solution to this problem ...

computron commented 6 years ago

Note also that the intention of max_disordered_sites was to help make EnumerateStructureTransformation a bit more automatic.

That said, I think this needs a rework. Maybe a separate wrapper for the automatic function. Also the automatic function can:

turn the individual site occupancies into fractions, with some tolerance. e.g. 1/4, 1/2, 1/3 occupied.
take the least common multiple of these fraction denominators. In the case above, the LCM is 12 - meaning you need at least 12 cells to get all these ordered correctly.
set the minimum and maximum cell size to be that LCM. There can be a parameter to make sure the LCM isn't too big (e.g., LCM of 1000) and you don't have too many cells. This parameter could either set a ceiling on the max cell size or (probably more relevant) set an upper ceiling on the number of disordered sites that are being ordered (which indirectly sets an upper ceiling on the max cell size).

shyuep commented 6 years ago

I think the problem is in enumlib itself. I did some basic debugging and the structure supplied to enumlib is not changing between loops. So somehow, enumlib is assuming that a 0.4 occupancy can be satisified by a Cu3Te2.

mkhorton commented 6 years ago

We could apply some sanity checks on the output / enforce a 'strict' mode to filter these incorrect composition structs out.

Agreed the ultimate issue is enumlib. It's too much of a black box for my liking.

Would also be happy to adopt the LCM approach suggested, I've tried something similar previously, possibly in MagOrdering, would have to check. I know we have DiscretizeOccupanciesTransformation now too which would take care of part of that approach.

shyuep commented 6 years ago

We could apply sanity checks, but I think let's confirm that the problem is indeed in enumlib, and not in some weird issue in pymatgen's generation of enumlib input files. If there is indeed a problem with enumlib, we should post an issue on enumlib itself and let Gus Hart know. On our end, we should probably implement a tolerance check that the compositions returned matches the intended ones to within a certain tolerance.

shyuep commented 6 years ago

I found the issue. The problem in this case is that the base (which is multipled by conc to generate composition ranges) is too small a number. That resulted in rounding precision errors which causes an overlap between the species and vacancy concentration ranges when generating the enumlib input file. For example, with base 8 and a supercell size of 2, the input file generated has the range of 2-4/8 for Cu, and 3-5/8 for vacancies. That resulted in a 4/8 4/8 split satisfying the enumeration, which is a Cu3Te2.

The solution is to make sure the base is large enough. I basically just applied a 10x factor to the base. This is an arbitrary factor in any case, so it does not make all that much of a difference.

The issue is closed for now. But someone should implement the aforementioned sanity checks with composition tolerances for the generated structure as well.

PMG dictator #1 has now semi-retired....

mkhorton commented 6 years ago

Nice catch, and duly noted! Will add it to the list ...

shyuep commented 6 years ago

Note that the fixed EnumlibAdaptor caused the max_disordered test to fail. The problem is that the max disordered test was actually wrongly coded. There was a primitive cubic cell with Li 0.2, Na 0.2 and K 0.6 + 1O. If you order this with max_ordered = 5, you obviously must have a cell with 10 atoms. The test was checking that it has only 8 atoms!! The above fix actually solves this problem.

computron commented 5 years ago

A quick postscript: @utf will be looking into coding an alternate enumerator in pymatgen using the bsym (https://github.com/bjmorgan/bsym) library. According to Alex, this library might improve upon common problems with enumlib.

materialsproject / pymatgen

EnumerateStructureTransformation does not return structures with correct stoichiometries #1286