hackingmaterials / matminer

Data mining for materials science
https://hackingmaterials.github.io/matminer/
Other
482 stars 194 forks source link

F. Ricci et al. Electronic Transport Properties available through load_dataset()? #606

Closed janosh closed 3 years ago

janosh commented 3 years ago

Is the MPContrib Electronic Transport dataset available via matminer?

This

from matminer.datasets import get_available_datasets

get_available_datasets()

prints

['boltztrap_mp',
 'brgoch_superhard_training',
 'castelli_perovskites',
 'citrine_thermal_conductivity',
 'dielectric_constant',
 'double_perovskites_gap',
 'double_perovskites_gap_lumo',
 'elastic_tensor_2015',
 'expt_formation_enthalpy',
 'expt_gap',
 'flla',
 'glass_binary',
 'glass_binary_v2',
 'glass_ternary_hipt',
 'glass_ternary_landolt',
 'heusler_magnetic',
 'jarvis_dft_2d',
 'jarvis_dft_3d',
 'jarvis_ml_dft_training',
 'm2ax',
 'matbench_dielectric',
 'matbench_expt_gap',
 'matbench_expt_is_metal',
 'matbench_glass',
 'matbench_jdft2d',
 'matbench_log_gvrh',
 'matbench_log_kvrh',
 'matbench_mp_e_form',
 'matbench_mp_gap',
 'matbench_mp_is_metal',
 'matbench_perovskites',
 'matbench_phonons',
 'matbench_steels',
 'mp_all_20181018',
 'mp_nostruct_20181018',
 'phonon_dielectric_mp',
 'piezoelectric_tensor',
 'steel_strength',
 'wolverton_oxides']

So I'm guessing not? If so, curious to know why.

Also, I'd like to suggest adding a short code block to each MPContrib detail page showing how to download it. E.g.

Use matminer (pip install matminer) to download this dataset programmatically:

from matminer.datasets import load_dataset

df = load_dataset("matbench_phonons")
ardunn commented 3 years ago

Hey @janosh

Currently the full data is not available through matminer, though if @tschaume wants to make a matminer-loadable static .json.gz of it available, I'd be glad to add it to matminer.

There is an abbreviated version of it: https://hackingmaterials.lbl.gov/matminer/dataset_summary.html, boltztrap_mp available in matminer. The following columns are available:

image

janosh commented 3 years ago

@ardunn Thanks for the quick reply! Do you have any information on how the 8,924 entries were selected from the 44,333 listed in the full dataset at https://contribs.materialsproject.org/projects/carrier_transport?

ardunn commented 3 years ago

I believe the ~9k entries were from a previous run

On Thu, Apr 1, 2021 at 11:06 PM Janosh Riebesell @.***> wrote:

@ardunn https://github.com/ardunn Thanks for the quick reply! Do you have any information on how the 8,924 entries were selected from the 44,333 listed in the full dataset at https://contribs.materialsproject.org/projects/carrier_transport?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/hackingmaterials/matminer/issues/606#issuecomment-812339601, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYDHS76QUQV6XOUZHFRYJDTGVNGRANCNFSM42HUAJSQ .

tschaume commented 3 years ago

@janosh @ardunn I do have different versions of a potential .json.gz files we could use to link the full dataset up to matminer. I'll make them available at a persistent link in MPContribs and report back here by Monday (hopefully).

tschaume commented 3 years ago

@janosh @ardunn There's a JSON file for download now at https://contribs.materialsproject.org/projects/carrier_transport.json.gz (12.5MB). It reflects the format of the contributions as they go into the MPContribs API and does not include the temperature- and doping-level dependent tables. Happy to iterate if it isn't a suitable format to link up to matminer. FYI @fraricci

janosh commented 3 years ago

Thanks a lot @tschaume! 👍

I'm guessing for addition to matminer it should be in a format ready for data mining. So probably not have dtype object (i.e. strings) for target columns but floats.

Here's a version of the dataset as we would use it with models like CGCNN: https://github.com/janosh/matbench/commit/df3831319599b9aa3768dd5f97fdac5ab94bdc37.

janosh commented 3 years ago

What's the meaning of .v in these columns?

Sᵉ.p.v [µV/K]
Sᵉ.n.v [µV/K]
σᵉ.p.v [1/Ω/m/s]
σᵉ.n.v [1/Ω/m/s]
PFᵉ.p.v [µW/cm/K²/s]
PFᵉ.n.v [µW/cm/K²/s]
κₑᵉ.p.v [W/K/m/s]
κₑᵉ.n.v [W/K/m/s]
janosh commented 3 years ago

Ah. From here:

Value (v), temperature (T), and doping level (c) at the maximum of the average eigenvalue of the Seebeck coefficient

ardunn commented 3 years ago

Thanks @janosh and @tschaume . I will add these to the metadata at the same time that I add Ryan Kingsbury's updated expt_gaps and _formation_enthalpy datasets. The columns will be casted to the correct dtypes before uploading as well.

ardunn commented 3 years ago

@janosh @tschaume I wound up using the carrier_transport_with_strucs.json.gz that @janosh referenced earlier. Unfortunately the file currently hosted on mpcontribs has a pesky data column which is not super easy to use, so the raw json.gz has been uploaded to figshare (https://figshare.com/articles/dataset/ricci_boltztrap_mp_tabular/14701110) in the meantime.

Notes for @janosh

The *_strucs.json.gz needed some minor adjustments.

Notable additions to metadata beyond what was in MPContribs:

Notes for @tschaume

If there is any major problems with hosting this data temporarily on figshare lmk and it will be removed immediately. Obviously the best scenario is if the matminer-compatible .json.gz is hosted on MPContribs. If there is no major problem keeping this file on Figshare in the interim it will remain there until MPContribs has a serviceable link to the matminer-compatible .json.gz. Let me know if/when that is done and I will update the matminer link.

janosh commented 3 years ago

all carrier concentrations at optimal values of S, kappa, PF, and conductivity were mis-parsed (e.g., 1e20 --> 120.0).

@ardunn Oops! I wasn't using those columns but very good thing you noticed. Thanks for making the data easily available through matminer! 😅