Closed jannisborn closed 4 years ago
Hi @jannisborn, Thanks for raising this issue! The MLM-based approach employed by HuggingFace for RoBERTa basically predicts only one individual masked token per sequence. In the case of word inputs, each masked token wouldn't just be one character usually, but given that for SMILES input each individual character is similar to a novel word, we can only predict one masked token. We're hoping to include support for multiple masked tokens as well as multi-character masked tokens soon!
I'd also recommend inquiring into the possibility of masking multiple tokens in Huggingface's transformer repository (https://github.com/huggingface/transformers). The backend of this research is on their infrastructure, and we mainly used MLM as a training procedure to try and learn the chemical syntax for different tasks. Let me know how it goes!
Many thanks for the rapid reply @seyonechithrananda. I agree it would be ideal to implement this in the huggingface backend. I opened an issue there.
It's easy to implement an auto-regressive approach as a user and for the moment, this is what I will do. Thanks for the help.
Hi, is it possible to mask multiple tokens at a time?
E.g.
fill_mask('CCCO<mask>C')
works fine. But writingfill_mask('CC<mask>CO<mask>C')
I obtain:Am I doing sth wrong or is this feature not supported @seyonechithrananda? Many thanks for a reply!