materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.
https://pymatgen.org
Other
1.51k stars 864 forks source link

Bandstructures from database have incorrect reciprocal lattices stored. #1031

Closed mfkoerner closed 1 year ago

mfkoerner commented 6 years ago

System

Summary

When using MPRester('API_key').get_bandstructure_by_material_id('mpid') to get BandStructureSymmLine objects from the database, many of them (~50%) have incorrect reciprocal lattices. Of those incorrect, about half of the incorrect ones are off by a scalar factor of 2pi (as in b1 = 4pi^2 (a2 x a3) / (a1 . a2 x a3) instead of the correct 2pi term). Some of them are off in angle but have the correct volume. Others are also off in volume.

Example code


import pymatgen as pmg
import numpy as np
mpr = pmg.MPRester("Your_API_Key")
mpidsdicts = mpr.query({'has_bandstructure': True}, ['material_id'])
mpids = [i['material_id'] for i in mpidsdicts]

def reciprocal(Lattice):
    """Returns the reciprocal of a lattice
    Lattice: input as [[a1],[a2],[a3]]
    Output: [[b1],[b2],[b3]]"""
    tsprod = np.dot(Lattice[0], np.cross(Lattice[1], Lattice[2]))    #Triple scalar product
    rLattice = 2*np.pi * np.array([np.cross(Lattice[1], Lattice[2]), 
                               np.cross(Lattice[2], Lattice[0]), 
                               np.cross(Lattice[0], Lattice[1])]) / tsprod
    return(rLattice)

N = 5
mpids = np.random.choice(mpids, N)
for mpid in mpids:
    bs = mpr.get_bandstructure_by_material_id(mpid)
    structure = mpr.query(mpid, ['structure'])[0]['structure']
    lattice = structure.lattice.matrix
    rec_lat = mmg.reciprocal(lattice)
    np.set_printoptions(precision=3, suppress= True)
    print('mpid\n', mpid)
    print('volume from structure', np.linalg.det(rec_lat))
    print('volume from bandstruct', np.linalg.det(bs.lattice_rec.matrix))
    print('lattice\n', lattice)         # real lattice from structure data
    print('rec_lattice\n', rec_lat)     # reciprocal lattice from structure data
    print('rec_lattice_bs\n', bs.lattice_rec)    # reciprocal lattice stored in band structure
    realabc = np.sort(np.sum(rec_lat**2, axis=1)**(.5))
    print('structure abc\n', realabc)   # sanity check for structure data
    bandabc = np.sort(np.array(bs.lattice_rec.abc))
    print('band abc\n', bandabc)        # sanity check for band data
    print('quotient', bandabc/realabc)  # quotient of them each (sometimes wrong order)

No error message, just wrong results about half the time

Testing run on error

Ran a census on the database and found the following:

Good: 47.2% Volume good: 18.4% 2pi per vector: 21.8% other: 12.7%

These data are collected from a census on the materials project bandstructures There are a total of 52857 materials considered here The data were taken from the database on July 14 2017

"Good" means that all lattice vectors and angles are correct to within 0.1 inverse angstroms and 0.1 degrees "Volume good" means that the lattice volume is correct to within 0.01 inverse angstroms cubed but the material is not "Good" "2pi per vector" means that the vectors are off by a factor of 2pi but the angles are correct. "Other" is any material that doesn't fit into the three previously mentioned categories.

Materials with different types of errors:

no materials are off by 4pi^2, 8pi^3, or 8pi per lattice vector 1 material is off by 2 (mp-20557) 4 materials were off by 4pi (mp-10137, mp-1576, mp-583266, mp-685055)

5 example materials for each category: good: mp-13981 mp-1008280 mp-11022 mp-763833 mp-770501 volume good: mp-989622 mp-754192 mp-643903 mp-1008601 mp-556356 2pi per vector: mp-626542 mp-4308 mp-31412 mp-773060 mp-11271 other: mp-773764 mp-705574 mp-772309 mp-505584 mp-767321 (note that many of these are off by nearly 2pi per vector but not exactly)

Files (if any)

no files

shyuep commented 6 years ago

@dwinston Pls fix in the Database?

mkhorton commented 6 years ago

FYI @mfkoerner, you have your private API key in the example code. There's also a reciprocal_lattice method on Lattice, so structure.lattice.reciprocal_lattice should give the reciprocal lattice directly -- this seems to agree with your reciprocal() function, but mentioning in case it's useful in future.

Regarding the error itself, I'm just noting mp-764505 as an example to investigate further.

mfkoerner commented 6 years ago

Responding to @mkhorton , I fixed up my report a bit to fix the API issue and include some results I found. Most notably, I have percentages of the type of error and example materials for each style of error.

So, as far as I can tell, the order of the reciprocal vectors in bandstructure.lattice_rec are different than those from my reciprocal() function and I don't know a consistent way to properly match the lattice vectors in bandstructure.lattice_rec to the locations of the covalent band minimum (CBM) and valence band maximums (VMB) in the bandstructure itself.

It would be extremely useful to me if anyone has a way to get the cartesian reciprocal space coordinates of the CBM and VBM correctly while this error persists. The proper way to do this if the error is fixed is BandStructure.get_cbm()['kpoint'].cart_coords, but this is unfortunately impacted by the error.

mkhorton commented 6 years ago

Thanks very much for the extra detail @mfkoerner, this is very useful. Unfortunately, I don't (personally) know a way to get the information you want via the API until we've tracked down the root cause, but we will investigate the issue asap.

dougfabini commented 6 years ago

I'd like to echo this issue. The information about the reciprocal lattice seems to quite frequently be materially different between the reciprocal lattice stored/calculated in the "structure" object and the reciprocal lattice stored/calculated in the "band structure" object of the electronic structure module. Like, @mfkoerner above, I've conducted a number of tests, including calculating the reciprocal lattice on my end from the real lattice stored in the "structure" object, the result of which always properly matches the reciprocal lattice from the "structure" object.

We are doing some analysis of the electronic structure which involves looking at absolute crystal momentum differences between various points in the BZ, so resolving these issues is essential for us. The issues are 2-fold:

  1. Many of the reciprocal lattices stored in the "band structure" objects have lattice vector lengths off by a factor of 2pi, a few other "round" numbers, or (less commonly) arbitrary numbers. This is quickly illustrated in the two plots I've attached below, which show that for many materials, the reciprocal lattice unit cell volume from the "band structure" object is 8pi^3 times too large, while the volume from the "structure" object is correct. As a quick sanity check, something like bcc-Li should have roughly the largest Brillouin zone possible, and indeed nothing has a much larger BZ in the seemingly correct entries from the "structure" object (volume roughly 12 A^-3). -> Examples of materials with BZ volume too big by 8pi^3 include LiAlSi (mp-3161), fcc-Al (mp-134), FeOF (mp-763221) and YSiNi (mp-20557). -> A particularly simple example is that of Si (mp-149) versus LiAlSi (mp-3161), where the latter has reciprocal lattice vectors each too long by a factor of 2pi in the "band structure" object, though the two materials have nearly identical electronic structures and should have nearly identical Brillouin zones. -> Examples of materials with BZ volume too big by 64pi^3 include UB2Ru3 (mp-10137), TmNi2 (mp-1576), MgZn (mp-583266), and CrTe2 (mp-685055). ->This issue results, for instance, in a gross overestimate of carrier effective masses or of the crystal momentum change of indirect optical transitions.
  2. Independently or in addition to issue (1) above, the reciprocal lattice from the "band structure" object is sometimes a different choice of vectors/angles than the one in the "structure" object. This means that we can't simply use the correct reciprocal lattice from the "structure" object to calculate things related to the band structure. Is this just an application of the rules in Setyawan & Curtarolo (https://doi.org/10.1016/j.commatsci.2010.05.010) for standardizing high symmetry paths through the BZ? -> Examples of BZ volume correct, but different choice of vectors / angles: Y(Al2Cr)4 (mp-30177), Bi2AuO5 (mp-618309), Y13Ho19O48 (mp-758033) -> Example of each reciprocal lattice vector off by 2pi AND reciprocal lattice vectors chosen differently (leading to, among other things, a different monoclinic angle) for FeOF (mp-7631230)

There are additionally many examples where the reciprocal lattices from the two database modules ("structure" and "band structure" objects) are rotated in cartesian space relative to one another, though this presents no problem as it is essentially just a choice of coordinates, and only differences in crystal momentum should matter for most analyses.

It seems that were it not for issue (2), a workaround for issue (1) would be easy, by simply querying the "correct" reciprocal lattice from the "structure" object. However, if the lattice vectors are cycled, or for lower symmetry crystal systems, a different choice of angles is made, it's unclear how to correctly map points from the band structure k-point path to the correct absolute crystal momentum. If we are correct about these issues, I also suspect it would be better to fix the issues in the database, rather than make a bespoke workaround on our end.

As a quick illustration of issue (1), plots of the BZ volumes from the two different locations in the database are below. [Note that these plots are only for the ~20k semiconductors (E_g < 2.5 eV) for which the database has a calculated bandstructure.]

reciplatt-issue-1-annotated

reciplatt-issue-2-annotated

Any help you can offer is greatly appreciated.

Thanks, Doug

mkhorton commented 6 years ago

Just to update, we believe we've identified (or partly identified) the issue, but the fix is going to take some time to address. Once this has been resolved, we'll write up a full description here of what happened and what's been changed as a result to ensure consistency in future. Many thanks both for the detailed bug report and for your patience while we address this.

dougfabini commented 6 years ago

@mkhorton Thanks very much. Any information you can provide about the rough timeline for fixing this would be extremely helpful for our project planning purposes.

mkhorton commented 6 years ago

(Caveat: we're still investigating this issue, so the following is subject to change.)

Ok, to provide a little more detail... Essentially, the structure optimization calculation and the band structure calculation are essentially two separate (though related) calculations, and the 'structure' reported by the API is from the former and the band structure from the latter. While it should be the case that both structures should match each other, clearly in some instances this is not the case. This is due to a combination of issues, partly involving faulty logic, and partly involving some stale data. A robust fix needs to address this, and should also include new database validation logic to ensure any similar issues are caught in future. As such, it will take some time to address. We're actually in the process of moving to a new database build system in part to avoid issues such as this.

However, as a stop-gap, it may be possible to extract the correct band structure object from the API. This is slightly convoluted since the API wasn't designed for this use, but the following code should work:

from pymatgen import MPRester, Structure

mpid = 'mp-764505' # mpid of interest
run_type = 'GGA' # material-dependent, can also be 'GGA+U'

mpr = MPRester()

data = mpr.query({'task_id':mpid},['band_structure'], mp_decode=False)
task_id = data[0]['band_structure'][run_type]['task_id']
task_data = mpr.get_task_data(task_id)
band_structure_structure = Structure.from_str(task_data[0]['cif'], fmt='cif')
band_structure_lattice = band_structure_structure.lattice

I believe this should work, and hope it's useful as an interim solution.

hautierg commented 6 years ago

Hi Matthew,

Francesco (here in cc) could help on this if needed.

Geoffroy

On 16 Feb 2018, at 20:38, Matthew Horton notifications@github.com wrote:

(Caveat: we're still investigating this issue, so the following is subject to change.)

Ok, to provide a little more detail... Essentially, the structure optimization calculation and the band structure calculation are essentially two separate (though related) calculations, and the 'structure' reported by the API is from the former and the band structure from the latter. While it should be the case that both structures should match each other, clearly in some instances this is not the case. This is due to a combination of issues, partly involving faulty logic, and partly involving some stale data. A robust fix needs to address this, and should also include new database validation logic to ensure any similar issues are caught in future. As such, it will take some time to address. We're actually in the process of moving to a new database build system in part to avoid issues such as this.

However, as a stop-gap, it may be possible to extract the correct band structure object from the API. This is slightly convoluted since the API wasn't designed for use, but the following code should work:

from pymatgen import MPRester, Structure

mpid = 'mp-764505' # mpid of interest run_type = 'GGA' # material-dependent

mpr = MPRester()

data = mpr.query({'task_id':mpid},['band_structure'], mp_decode=False) task_id = data[0]['band_structure'][run_type]['task_id'] task_data = mpr.get_task_data(task_id) band_structure_structure = Structure.from_str(task_data[0]['cif'], fmt='cif') band_structure_lattice = band_structure_structure.lattice I believe this should work, and hope it's useful as an interim solution.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/materialsproject/pymatgen/issues/1031#issuecomment-366337635, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3l9_lGHjq6nohYlA_OU3MKDoWITQVoks5tVdlDgaJpZM4R4VsI.

dougfabini commented 6 years ago

Thanks, Matthew, this is very helpful. Just to give you a bit of additional information in case it helps on your end, the workaround certainly fixes the issue for many compounds. However, some are off by small but meaningful amounts, very small amounts (perhaps just different relaxation parameters?), or by a factor of 2 or 4. I haven't dug into the last issue much yet, my first thought was perhaps these were magnetically ordered supercells for the electronic structure calculations, but that would make the BZ smaller rather than larger...do you have any insight on this? The entries with this issue number only in the hundreds, but it would be nice not to have exclude them on this basis from our analysis.

Here are two quick figures, again for ~19,000 semiconductors with E_g < 2.5 eV.

bz-volume-ratio

reciplatt-workaround-2

steicher commented 6 years ago

Hi again!

@dhfabini, @mfkoerner and I have made a good deal of progress with the workaround offered by @mkhorton, but discovered this month (07/18) that it no longer seems to work. We have four computers with separately installed versions of python + pymatgen and the line:

task_data = mpr.get_task_data(task_id)

no longer seems to work on any of them. For example, when called on Si (mp-149), I get:

pymatgen.ext.matproj.MPRestError: 'NoneType' object is not iterable. Content: b'{"valid_response": false, "error": "\'NoneType\' object is not iterable", "version": {"db": "3.0.0", "pymatgen": "2018.6.11", "rest": "2.0"}, "created_at": "2018-07-06T15:59:31.330706", "traceback": "Traceback (most recent call last):\\n File \\"/var/www/python/matgen_prod/materials_django/rest/rest.py\\", line 94, in wrapped\\n d = func(*args, **kwargs)\\n File \\"/var/www/python/matgen_prod/materials_django/tasks/rest.py\\", line 47, in get_property\\n entries = qe.get_entries(crit, False, supported_task_properties)\\n File \\"/var/www/miniconda3/envs/mpprod3/lib/python3.6/site-packages/matgendb/query_engine.py\\", line 305, in get_entries\\n symbols = [\\"{} {}\\".format(func, label) for label in labels]\\nTypeError: \'NoneType\' object is not iterable\\n"}'

Is there an easy way around this? Thanks again!

janosh commented 1 year ago

Closing as stale.