Dataset Card for MatText

Dataset Card for MatText

Dataset Description

Homepage: https://github.com/lamalab-org/MatText
Repository: https://github.com/lamalab-org/MatText
Paper: To be published
Leaderboard: To be published
Point of Contact: Nawaf Alampara

Dataset Summary

The dataset contains crystal structures in various text representations and labels for some subsets.

Supported Tasks and Leaderboards

The task for the pertaining dataset is self-supervised language modeling. For the fine-tuning dataset, supervised property prediction.

Languages

This is not a natural language dataset.

Dataset Structure

Data Instances

The instances represent materials. They are crystal structures of 3D-connected solid materials.

Data Fields

local_env (string): The Local Env text representation of a material
slices (string): The SLICES representation of a material
cif_p1 (string): The CIF representation of a material in P$_1$ symmetry
composition (string): The composition of a material in Hill notation
crystal_text_llm (string): The text representation of a material proposed in Gruver et al.
atom_sequences_plusplus (string): A space-separated enumeration of element symbols and the lattice parameters
labels (float): For the gvrh datasets, the targets are the base 10 logarithm of the DFT Voigt-Reuss-Hill average shear moduli in GPa. For the kvrh datasets, the base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa. For the perovskite dataset, the labels are the heat of formation of the entire cell, in eV, as calculated by RPBE GGA-DFT. For the pretaining datasets, there are no labels.
mbid (string): a unique identifier of a material
cif_symmetrized (string): The CIF representation of a material in higher symmetry
atom_sequences (string): A space-separated enumeration of element symbols
zmatrix (string): A z-matrix (internal coordinates) representation of the material

Data Splits

For the benchmarking we follow the five-fold cross-validation proposed by MatBench. The folds are uploaded as splits to HuggingFace.

Dataset Creation

Curation Rationale

The dataset was created to enable the training and benchmarking of text-based modeling of materials properties. However, how different representations perform for materials modeling tasks has not been understood.

Source Data

Initial Data Collection and Normalization

The pertaining dataset is a subset of the materials deposited in the NOMAD archive. We queried only 3D-connected structures (i.e., excluding 2D materials, which often require special treatment) and, for consistency, limited our query to materials for which the bandgap has been computed using the PBE functional and the VASP code.

The benchmarking datasets are derived from MatBench. We limited ourselves to the smaller subsets for regression tasks, for which crystal structures are provided. Some instances are dropped because text representations could not be derived.

Who are the source language producers?

n/a

Annotations

Annotation process

The only annotations are text representations that we derived using our MatText framework.

Who are the annotators?

n/a

Personal and Sensitive Information

n/a

Considerations for Using the Data

Social Impact of Dataset

There are many potential consequences of our work, none of which we feel are societal impacts that must be specifically highlighted here.

Discussion of Biases

There might be biases in certain parts of the materials space being oversampled.

Other Known Limitations

To our knowledge, there are no duplicates. While we took care to avoid errors, there might be errors, for example, due to problems with the crystal structures in the raw data.

Additional Information

Dataset Curators

The dataset was curated by Nawaf Alampara, Santiago Miret, and Kevin Maik Jablonka.

Licensing Information

The dataset is provided with an MIT license.

Citation Information

[More Information Needed]

Contributions

Thanks to n0w0f for adding this dataset.

lamalab-org / MatText

datacard for HF #82