RegexTokenizer - Githubissues

pschwllr commented 2 years ago

Feature request

We would like to implement a general RegexTokenizer, which gets a regex as input and tokenizes strings according to this regex.

Motivation

In chemistry, for example, there are line notations like SMILES (http://opensmiles.org/opensmiles.html), which can be used to represent molecules and reactions as strings.

In previous work, such as the MolecularTransformer (https://pubs.acs.org/doi/full/10.1021/acscentsci.9b00576, built with OpenNMT) or RXNMapper (https://www.science.org/doi/10.1126/sciadv.abe4166, with huggingface/transformers), we used a regex to split SMILES by atoms/bonds.

SMI_REGEX_PATTERN = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"

def smi_tokenizer(smi, pattern=SMI_REGEX_PATTERN):
    """
    Tokenize a SMILES molecule or reaction
    """
    import re
    regex = re.compile(pattern)
    tokens = [token for token in regex.findall(smi)]
    assert smi == ''.join(tokens)
    return ' '.join(tokens)

But every time we want to change the transformer model, we have to rewrite the tokenizer and redefine it, to make it work with the model. Would there be a more efficient and general way to do it? We could imagine that also other fields (e.g. proteins) could benefit from a RegexTokenizer.

Your contribution

Happy to help with the PR. The regex for SMILES (chemistry) is ready. We just don't know where to best start.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

hogru commented 2 years ago

Hi @pschwllr,

I currently write my master thesis dealing with molecules and transformers and came here looking for a SMILES tokenizer to use with Hugging Face transformers. I am neither into molecular biology nor Hugging Face, so proceed with some caution, but maybe this is still useful. If it is, please let me know, especially if you have ideas how to improve it.

This code snippet provides a tokenizer that can be used with Hugging Face transformers. It uses a simple Word Level algorithm, which you could easily replace with BPE etc..

from tokenizers import Regex, Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import Split
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordLevelTrainer
from transformers import PreTrainedTokenizerFast

SMI_REGEX_PATTERN = r"""(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"""
BOS_TOKEN = "^"
EOS_TOKEN = "&"
PAD_TOKEN = " "
UNK_TOKEN = "?"
MODEL_MAX_LENGTH = 120

smi = "CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1"

smiles_tokenizer = Tokenizer(WordLevel(unk_token=UNK_TOKEN))
smiles_tokenizer.pre_tokenizer = Split(
    pattern=Regex(SMI_REGEX_PATTERN), behavior="isolated", invert=False
)
smiles_trainer = WordLevelTrainer(
    special_tokens=[BOS_TOKEN, EOS_TOKEN, PAD_TOKEN, UNK_TOKEN]
)
smiles_tokenizer.train_from_iterator(smi, trainer=smiles_trainer)
smiles_tokenizer.post_processor = TemplateProcessing(
    single=BOS_TOKEN + " $A " + EOS_TOKEN,
    special_tokens=[
        (BOS_TOKEN, smiles_tokenizer.token_to_id(BOS_TOKEN)),
        (EOS_TOKEN, smiles_tokenizer.token_to_id(EOS_TOKEN)),
    ],
)

tokenizer_pretrained = PreTrainedTokenizerFast(
    tokenizer_object=smiles_tokenizer,
    model_max_length=MODEL_MAX_LENGTH,
    padding_side="right",
    truncation_side="left",
    bos_token=BOS_TOKEN,
    eos_token=EOS_TOKEN,
    pad_token=PAD_TOKEN,
    unk_token=UNK_TOKEN,
)

print(tokenizer_pretrained.encode(smi))  # [0, 5, 5, 6, 5, ..., 4, 8, 1]

huggingface / transformers

RegexTokenizer #17862

Feature request

Motivation

Your contribution