songlab-cal / tape

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
https://www.biorxiv.org/content/10.1101/676825v1
BSD 3-Clause "New" or "Revised" License
659 stars 129 forks source link

attention masks tokenizer #126

Open Ch-rode opened 2 years ago

Ch-rode commented 2 years ago

Hello ! I'm trying to implement bert-base but I have not clear how do you generate the masks with the TapeTokenizer. This is my code

model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')

def preprocessing_for_tape(data):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in data:
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode(
            sent,  # Preprocess sentence
            #add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            #max_length=MAX_LEN,                  # Max length to truncate/pad
            #pad_to_max_length=True,         # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            #return_attention_mask=True,
            #truncation=True     # Return attention mask
            )

        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks`
sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')
token_ids

tensor([[ 2, 11,  7, 23, 25,  9,  8, 21,  7, 15, 13, 11, 16, 11,  5, 13, 15, 15,
         17, 11,  7, 25, 13, 11, 22, 11, 22, 15, 25,  5,  5, 11,  5, 15, 13, 23,
         20,  3]])

But my output (for example) will have only token ids (no attention mask and no possibility to set max_length or padding). How does it works? Thanks

rmrao commented 2 years ago

Hi! Do you specifically want to re-implement bert-base, or just a transformer? I have code to train a version of ESM-1b here. This code scales better and will also result in better performance.

In that repo, the data processing is done in these lines. The masking code is then implemented in this class.

I have a bunch of utilities implemented in github.com/rmrao/evo, if it's helpful.

If you specifically want the masking code from TAPE, it's implemented here.

Hope this helps!

Ch-rode commented 2 years ago

Hello ! Thanks for your informations. I would like to re-implement bert-base for Sequence Classification task.