lamalab-org / MatText

Text-based modeling of materials.
https://lamalab-org.github.io/MatText/
MIT License
23 stars 2 forks source link

datacard for HF #82

Closed kjappelbaum closed 3 months ago

kjappelbaum commented 3 months ago

Dataset Card for MatText

Table of Contents

Dataset Description

Dataset Summary

The dataset contains crystal structures in various text representations and labels for some subsets.

Supported Tasks and Leaderboards

The task for the pertaining dataset is self-supervised language modeling. For the fine-tuning dataset, supervised property prediction.

Languages

This is not a natural language dataset.

Dataset Structure

Data Instances

The instances represent materials. They are crystal structures of 3D-connected solid materials.

Data Fields

Data Splits

For the benchmarking we follow the five-fold cross-validation proposed by MatBench. The folds are uploaded as splits to HuggingFace.

Dataset Creation

Curation Rationale

The dataset was created to enable the training and benchmarking of text-based modeling of materials properties. However, how different representations perform for materials modeling tasks has not been understood.

Source Data

Initial Data Collection and Normalization

The pertaining dataset is a subset of the materials deposited in the NOMAD archive. We queried only 3D-connected structures (i.e., excluding 2D materials, which often require special treatment) and, for consistency, limited our query to materials for which the bandgap has been computed using the PBE functional and the VASP code.

The benchmarking datasets are derived from MatBench. We limited ourselves to the smaller subsets for regression tasks, for which crystal structures are provided. Some instances are dropped because text representations could not be derived.

Who are the source language producers?

n/a

Annotations

Annotation process

The only annotations are text representations that we derived using our MatText framework.

Who are the annotators?

n/a

Personal and Sensitive Information

n/a

Considerations for Using the Data

Social Impact of Dataset

There are many potential consequences of our work, none of which we feel are societal impacts that must be specifically highlighted here.

Discussion of Biases

There might be biases in certain parts of the materials space being oversampled.

Other Known Limitations

To our knowledge, there are no duplicates. While we took care to avoid errors, there might be errors, for example, due to problems with the crystal structures in the raw data.

Additional Information

Dataset Curators

The dataset was curated by Nawaf Alampara, Santiago Miret, and Kevin Maik Jablonka.

Licensing Information

The dataset is provided with an MIT license.

Citation Information

[More Information Needed]

Contributions

Thanks to n0w0f for adding this dataset.