Excellent results on protein sequences

I was super curious to try this out on my favorite discrete data: protein sequences! I created a simple dataset class following the existing code:

https://github.com/dacarlin/protein-sedd/blob/main/data.py#L119

And did some experiments training on an A100 on this test dataset. It’s pretty cool, even just leaving the GPT2 tokenizer in place, within just a couple thousand steps, the model is already producing new proteins that are predicted to fold into the correct structure by ESMFold. In contrast, proteins produced by a GPT-like model at this stage of training are not well-predicted.

However, in contrast to the 50,257-token vocabulary used in GPT-2, the “token space” for proteins is just the 20 amino acids plus a few special tokens for sequence start and end. For example, here's the vocab for my protein tokenizer:

{
    "<s>": 0,
    "<pad>": 1,
    "</s>": 2,
    "<unk>": 3,
    "<mask>": 4,
    "A": 5,
    "C": 6,
    "D": 7,
    "E": 8,
    "F": 9,
    "G": 10,
    "H": 11,
    "I": 12,
    "K": 13,
    "L": 14,
    "M": 15,
    "N": 16,
    "P": 17,
    "Q": 18,
    "R": 19,
    "S": 20,
    "T": 21,
    "V": 22,
    "W": 23,
    "Y": 24
}

I’d be super curious to swap out the tokenizer and adjust the input and output embedding layers to handle this small vocabulary. Basically modify the model to be “character level” or equivalently to use single characters as the only tokens.

Do you have any tip or pointers about what we’d need to do to achieve that goal? I tried providing a custom vocab to the existing GPT2TokenizerFast tokenizer class, but this resulted in empty samples after training (perhaps I failed to update another instance of the tokenizer elsewhere?). I tried also adjusting the number of tokens in the Hydra configs to 25 to match the new vocabulary, but this resulted in out of range (indexing) errors in the forward pass.

I figure you may have already thought about this, especially since you mention protein sequences in your lecture, which is a great on-ramp to this work. Do you think you could point me towards the steps I'd need to take to adopt the model to the protein sequence vocabulary? I think based on what I’ve seen so far, this method is very likely to outperform autoregressive models for proteins, and I have some experiments in mind to show that result. Briefly, I'd like to train on this HuggingFace dataset as a start and show results on a few evals I have developed.

Huge thank you, Aaron, for your work in this space and providing this super modular and scalable repo with the code implementation! Super interested to hear your thoughts on what minor changes we'd need to adapt this work for protein sequences.

louaaron / Score-Entropy-Discrete-Diffusion

Excellent results on protein sequences #12