materialsproject / api

New API client for the Materials Project
https://materialsproject.github.io/api/
Other
107 stars 39 forks source link

New return format are sometimes incorrect and is much harder to convert to DataFrame #674

Open shyuep opened 1 year ago

shyuep commented 1 year ago

Example:

from pymatgen.ext.matproj import MPRester
with MPRester("<api key>") as mpr:
    data = mpr.summary.search(chemsys=["*-O"], fields=["material_id", "formula_pretty", "energy_per_atom", "band_gap", "k_vrh"])
# What is returned is a list. Let's just see what the first item in the list looks out. 
import pprint
pprint.pprint(data[0])

gives

MPDataDoc<SummaryDoc>(
formula_pretty='In2O3',
material_id=MPID(mp-1245202),
energy_per_atom=-5.6541087496,
band_gap=0.6920999999999999,
fields_not_requested=['builder_meta', 'nsites', 'elements', 'nelements', 'composition', 'composition_reduced', 'formula_anonymous', 'chemsys', 'volume', 'density', 'density_atomic', 'symmetry', 'property_name', 'deprecated', 'deprecation_reasons', 'last_updated', 'origins', 'warnings', 'structure', 'task_ids', 'uncorrected_energy_per_atom', 'formation_energy_per_atom', 'energy_above_hull', 'is_stable', 'equilibrium_reaction_energy_per_atom', 'decomposes_to', 'xas', 'grain_boundaries', 'cbm', 'vbm', 'efermi', 'is_gap_direct', 'is_metal', 'es_source_calc_id', 'bandstructure', 'dos', 'dos_energy_up', 'dos_energy_down', 'is_magnetic', 'ordering', 'total_magnetization', 'total_magnetization_normalized_vol', 'total_magnetization_normalized_formula_units', 'num_magnetic_sites', 'num_unique_magnetic_sites', 'types_of_magnetic_species', 'k_voigt', 'k_reuss', 'k_vrh', 'g_voigt', 'g_reuss', 'g_vrh', 'universal_anisotropy', 'homogeneous_poisson', 'e_total', 'e_ionic', 'e_electronic', 'n', 'e_ij_max', 'weighted_surface_energy_EV_PER_ANG2', 'weighted_surface_energy', 'weighted_work_function', 'surface_anisotropy', 'shape_factor', 'has_reconstructed', 'possible_species', 'has_props', 'theoretical']
)

The returned summary says that k_vrh was not requested when it was. Using the expensive all_fields=True also does not return the k_vrh.

Further, the new return format is a list of SummaryDocs. These do not support a dict-like API and so, what used to be an easy way to convert MPRester queries to a pandas DataFrame now requires a conversion. This is not ideal and non-obvious.

# Old MPRester
df = pd.DataFrame(data)
# New MPRester
df = pd.DataFrame([d.dict() for d in data])
munrojm commented 1 year ago

Thanks for the heads up on the first bug. I'll take a closer look at the problem with returning k_vrh.

We have had a few people bring up the dataframe conversion issue as well. I intend to get a fix for this very soon.

shyuep commented 1 year ago

I should add that things like weighted_surface_energy is also not working. This is an urgent problem. Right not, this makes the new API unusable for anything but the simplest queries using chemsys or formula and you can't even get most of the properties.

munrojm commented 1 year ago

This is actually due to a data building issue where materials that don't have the certain fields available to them do not properly have None populated in the MongoDB collection. This makes fields like k_vrh and weighted_surface_energy not appear as requested for materials like mp-1245202, even if they are. I will note this does not limit the availability of data to users.

This is solved in the new build to be put out in the next few weeks, but I also have the ability to patch release the existing one if you feel this is a big enough issue.

shyuep commented 1 year ago

@munrojm I am not quite sure what you mean. In2O3 itself does not have a k_vrh, but other oxides do. My query is for all oxides. As far as I can see, none of the k_vrh are returned.E.g., if I change the query to be formula=["Li2O"], there is also no k_vrh (and I know for sure there exists at least one MP material (e.g., the standard ground state) with a k_vrh). So data availability is definitely limited? For instance, I (a pretty sophisticated user) cannot figure out how to get the elastic constants unless I first get the task_ids and then follow that with a /elasticity search.

It is rather urgent because the summary is the easiest way to get most of the properties, rather than having to navigate through all the other endpoints like surface_properties. "summary" should be the default endpoint except for sophisticated users requiring much more complex stuff.

In any case, I think the API return should not be dependent on the builder code? (At least not to the extent that a bug in builder causes a major failure in the API code).

munrojm commented 1 year ago

@shyuep Ah, okay you are completely correct here. I only looked at the single MPID when I tested, not your other oxide query. While what I said is true, there is a bigger issue here where not all of the elastic data is properly being incorporated into the summary collection by the builder. This is why k_vrh doesn't show up at all in your binary oxide query. I am guessing this is due to a mismatch between older task_id keys in elasticity, and updated material_id keys in summary.

This is definitely very urgent, and I will get make sure the builder and data is update by EOD tomorrow. Should be a pretty straightforward fix.

munrojm commented 1 year ago

@shyuep, summary data should now be fixed.

shyuep commented 1 year ago

Great. Thanks for your prompt attention on this. I will test it later.

munrojm commented 1 year ago

Have had to roll back to hotfix something. Will be up again soon in case you see the incorrect data.

janosh commented 1 year ago

@shyuep If you instantiate the new MPRester with use_document_model=False it returns dicts which can be easily converted to dataframe.