Closed pschwllr closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @pschwllr,
I currently write my master thesis dealing with molecules and transformers and came here looking for a SMILES tokenizer to use with Hugging Face transformers. I am neither into molecular biology nor Hugging Face, so proceed with some caution, but maybe this is still useful. If it is, please let me know, especially if you have ideas how to improve it.
This code snippet provides a tokenizer that can be used with Hugging Face transformers. It uses a simple Word Level algorithm, which you could easily replace with BPE etc..
from tokenizers import Regex, Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import Split
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordLevelTrainer
from transformers import PreTrainedTokenizerFast
SMI_REGEX_PATTERN = r"""(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"""
BOS_TOKEN = "^"
EOS_TOKEN = "&"
PAD_TOKEN = " "
UNK_TOKEN = "?"
MODEL_MAX_LENGTH = 120
smi = "CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1"
smiles_tokenizer = Tokenizer(WordLevel(unk_token=UNK_TOKEN))
smiles_tokenizer.pre_tokenizer = Split(
pattern=Regex(SMI_REGEX_PATTERN), behavior="isolated", invert=False
)
smiles_trainer = WordLevelTrainer(
special_tokens=[BOS_TOKEN, EOS_TOKEN, PAD_TOKEN, UNK_TOKEN]
)
smiles_tokenizer.train_from_iterator(smi, trainer=smiles_trainer)
smiles_tokenizer.post_processor = TemplateProcessing(
single=BOS_TOKEN + " $A " + EOS_TOKEN,
special_tokens=[
(BOS_TOKEN, smiles_tokenizer.token_to_id(BOS_TOKEN)),
(EOS_TOKEN, smiles_tokenizer.token_to_id(EOS_TOKEN)),
],
)
tokenizer_pretrained = PreTrainedTokenizerFast(
tokenizer_object=smiles_tokenizer,
model_max_length=MODEL_MAX_LENGTH,
padding_side="right",
truncation_side="left",
bos_token=BOS_TOKEN,
eos_token=EOS_TOKEN,
pad_token=PAD_TOKEN,
unk_token=UNK_TOKEN,
)
print(tokenizer_pretrained.encode(smi)) # [0, 5, 5, 6, 5, ..., 4, 8, 1]
Feature request
We would like to implement a general RegexTokenizer, which gets a regex as input and tokenizes strings according to this regex.
Motivation
In chemistry, for example, there are line notations like SMILES (http://opensmiles.org/opensmiles.html), which can be used to represent molecules and reactions as strings.
In previous work, such as the MolecularTransformer (https://pubs.acs.org/doi/full/10.1021/acscentsci.9b00576, built with OpenNMT) or RXNMapper (https://www.science.org/doi/10.1126/sciadv.abe4166, with huggingface/transformers), we used a regex to split SMILES by atoms/bonds.
But every time we want to change the transformer model, we have to rewrite the tokenizer and redefine it, to make it work with the model. Would there be a more efficient and general way to do it? We could imagine that also other fields (e.g. proteins) could benefit from a RegexTokenizer.
Your contribution
Happy to help with the PR. The regex for SMILES (chemistry) is ready. We just don't know where to best start.