seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.
MIT License
390 stars 60 forks source link

Masking multiple tokens at a time #2

Closed jannisborn closed 4 years ago

jannisborn commented 4 years ago

Hi, is it possible to mask multiple tokens at a time?

E.g. fill_mask('CCCO<mask>C') works fine. But writing fill_mask('CC<mask>CO<mask>C') I obtain:

~/miniconda3/envs/paccmann/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    553                 values, predictions = topk.values.numpy(), topk.indices.numpy()
    554             else:
--> 555                 masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
    556                 logits = outputs[i, masked_index, :]
    557                 probs = logits.softmax(dim=0)

ValueError: only one element tensors can be converted to Python scalars

Am I doing sth wrong or is this feature not supported @seyonechithrananda? Many thanks for a reply!

seyonechithrananda commented 4 years ago

Hi @jannisborn, Thanks for raising this issue! The MLM-based approach employed by HuggingFace for RoBERTa basically predicts only one individual masked token per sequence. In the case of word inputs, each masked token wouldn't just be one character usually, but given that for SMILES input each individual character is similar to a novel word, we can only predict one masked token. We're hoping to include support for multiple masked tokens as well as multi-character masked tokens soon!

seyonechithrananda commented 4 years ago

I'd also recommend inquiring into the possibility of masking multiple tokens in Huggingface's transformer repository (https://github.com/huggingface/transformers). The backend of this research is on their infrastructure, and we mainly used MLM as a training procedure to try and learn the chemical syntax for different tasks. Let me know how it goes!

jannisborn commented 4 years ago

Many thanks for the rapid reply @seyonechithrananda. I agree it would be ideal to implement this in the huggingface backend. I opened an issue there.

It's easy to implement an auto-regressive approach as a user and for the moment, this is what I will do. Thanks for the help.