cif_p1 (string): The CIF representation of a material in P$_1$ symmetry
composition (string): The composition of a material in Hill notation
crystal_text_llm (string): The text representation of a material proposed in Gruver et al.
atom_sequences_plusplus (string): A space-separated enumeration of element symbols and the lattice parameters
labels (float): For the gvrh datasets, the targets are the base 10 logarithm of the DFT Voigt-Reuss-Hill average shear moduli in GPa. For the kvrh datasets, the base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa. For the perovskite dataset, the labels are the heat of formation of the entire cell, in eV, as calculated by RPBE GGA-DFT. For the pretaining datasets, there are no labels.
mbid (string): a unique identifier of a material
cif_symmetrized (string): The CIF representation of a material in higher symmetry
atom_sequences (string): A space-separated enumeration of element symbols
zmatrix (string): A z-matrix (internal coordinates) representation of the material
Data Splits
For the benchmarking we follow the five-fold cross-validation proposed by MatBench. The folds are uploaded as splits to HuggingFace.
Dataset Creation
Curation Rationale
The dataset was created to enable the training and benchmarking of text-based modeling of materials properties. However, how different representations perform for materials modeling tasks has not been understood.
Source Data
Initial Data Collection and Normalization
The pertaining dataset is a subset of the materials deposited in the NOMAD archive. We queried only 3D-connected structures (i.e., excluding 2D materials, which often require special treatment) and, for consistency, limited our query to materials for which the bandgap has been computed using the PBE functional and the VASP code.
The benchmarking datasets are derived from MatBench. We limited ourselves to the smaller subsets for regression tasks, for which crystal structures are provided. Some instances are dropped because text representations could not be derived.
Who are the source language producers?
n/a
Annotations
Annotation process
The only annotations are text representations that we derived using our MatText framework.
Who are the annotators?
n/a
Personal and Sensitive Information
n/a
Considerations for Using the Data
Social Impact of Dataset
There are many potential consequences of our work, none of which we feel are societal impacts that must be specifically highlighted here.
Discussion of Biases
There might be biases in certain parts of the materials space being oversampled.
Other Known Limitations
To our knowledge, there are no duplicates. While we took care to avoid errors, there might be errors, for example, due to problems with the crystal structures in the raw data.
Additional Information
Dataset Curators
The dataset was curated by Nawaf Alampara, Santiago Miret, and Kevin Maik Jablonka.
Dataset Card for MatText
Table of Contents
Dataset Description
Dataset Summary
The dataset contains crystal structures in various text representations and labels for some subsets.
Supported Tasks and Leaderboards
The task for the pertaining dataset is self-supervised language modeling. For the fine-tuning dataset, supervised property prediction.
Languages
This is not a natural language dataset.
Dataset Structure
Data Instances
The instances represent materials. They are crystal structures of 3D-connected solid materials.
Data Fields
local_env
(string): The Local Env text representation of a materialslices
(string): The SLICES representation of a materialcif_p1
(string): The CIF representation of a material in P$_1$ symmetrycomposition
(string): The composition of a material in Hill notationcrystal_text_llm
(string): The text representation of a material proposed in Gruver et al.atom_sequences_plusplus
(string): A space-separated enumeration of element symbols and the lattice parameterslabels
(float): For thegvrh
datasets, the targets are the base 10 logarithm of the DFT Voigt-Reuss-Hill average shear moduli in GPa. For thekvrh
datasets, the base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa. For theperovskite
dataset, the labels are the heat of formation of the entire cell, in eV, as calculated by RPBE GGA-DFT. For the pretaining datasets, there are no labels.mbid
(string): a unique identifier of a materialcif_symmetrized
(string): The CIF representation of a material in higher symmetryatom_sequences
(string): A space-separated enumeration of element symbolszmatrix
(string): A z-matrix (internal coordinates) representation of the materialData Splits
For the benchmarking we follow the five-fold cross-validation proposed by MatBench. The folds are uploaded as splits to HuggingFace.
Dataset Creation
Curation Rationale
The dataset was created to enable the training and benchmarking of text-based modeling of materials properties. However, how different representations perform for materials modeling tasks has not been understood.
Source Data
Initial Data Collection and Normalization
The pertaining dataset is a subset of the materials deposited in the NOMAD archive. We queried only 3D-connected structures (i.e., excluding 2D materials, which often require special treatment) and, for consistency, limited our query to materials for which the bandgap has been computed using the PBE functional and the VASP code.
The benchmarking datasets are derived from MatBench. We limited ourselves to the smaller subsets for regression tasks, for which crystal structures are provided. Some instances are dropped because text representations could not be derived.
Who are the source language producers?
n/a
Annotations
Annotation process
The only annotations are text representations that we derived using our MatText framework.
Who are the annotators?
n/a
Personal and Sensitive Information
n/a
Considerations for Using the Data
Social Impact of Dataset
There are many potential consequences of our work, none of which we feel are societal impacts that must be specifically highlighted here.
Discussion of Biases
There might be biases in certain parts of the materials space being oversampled.
Other Known Limitations
To our knowledge, there are no duplicates. While we took care to avoid errors, there might be errors, for example, due to problems with the crystal structures in the raw data.
Additional Information
Dataset Curators
The dataset was curated by Nawaf Alampara, Santiago Miret, and Kevin Maik Jablonka.
Licensing Information
The dataset is provided with an MIT license.
Citation Information
[More Information Needed]
Contributions
Thanks to n0w0f for adding this dataset.