xtal2txt

Package to define, convert, encode and decode crystal structures into text representations. xtal2txt is an important part of our MatText framework.

💪 Getting Started

🚀 Installation

The most recent release can be installed from PyPI with:

$ pip install xtal2txt

The most recent code and data can be installed directly from GitHub with:

$ pip install git+https://github.com/lamalab-org/xtal2txt.git

Text Representation with xtal2txt

The TextRep class in xtal2txt.core facilitates the transformation of crystal structures into different text representations. Below is an example of its usage:

from xtal2txt.core import TextRep
from pymatgen.core import Structure

# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")

# Initialize TextRep Class
text_rep = TextRep.from_input(structure)

requested_reps = [
        "cif_p1",
        "slices",
        "atom_sequences",
        "atom_sequences_plusplus",
        "crystal_text_llm",
        "zmatrix"
]

# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)

Using xtal2txt Tokenizers

By default, the tokenizer is initialized with \[CLS\] and \[SEP\] tokens. For an example, see the SliceTokenizer usage:

from xtal2txt.tokenizer import SliceTokenizer

tokenizer = SliceTokenizer(
                model_max_length=512, 
                truncation=True, 
                padding="max_length", 
                max_length=512
            )
print(tokenizer.cls_token) # returns [CLS]

You can access the \[CLS\] token using the [cls_token]{.title-ref} attribute of the tokenizer. During decoding, you can utilize the [skip_special_tokens]{.title-ref} parameter to skip these special tokens.

Decoding with skipping special tokens:

tokenizer.decode(token_ids, skip_special_tokens=True)

Initializing tokenizers with custom special tokens

In scenarios where the \[CLS\] token is not required, you can initialize the tokenizer with an empty special_tokens dictionary.

Initialization without \[CLS\] and \[SEP\] tokens:

tokenizer = SliceTokenizer(
                model_max_length=512, 
                special_tokens={}, 
                truncation=True,
                padding="max_length", 
                max_length=512
            )

All Xtal2txtTokenizer instances inherit from PreTrainedTokenizer and accept arguments compatible with the Hugging Face tokenizer.

Tokenizers with special number tokenization

The special_num_token argument (by default False) can be set to true to tokenize numbers in a special way as designed and implemented by RegressionTransformer.

tokenizer = SliceTokenizer(
                special_num_token=True,
                model_max_length=512, 
                special_tokens={}, 
                truncation=True,
                padding="max_length", 
                max_length=512
            )

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License. See the Notice for imported LGPL code.

💰 Funding

This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.

lamalab-org / xtal2txt

readme