nadavbra / protein_bert

477 stars 98 forks source link

Use ProteinBERT with Own Dataset #77

Closed boncul closed 7 months ago

boncul commented 8 months ago

At first, I was going to train a BERT model from scratch with the animal transcription factor proteins I had, but I was not successful. While doing research, I came across your ProteinBERT study and article. I had previously trained the Gensim Word2Vec model with a dataset of plant transcription factor proteins and used embeddings as follows:

# ...
# assume there is Word2Vec training before
# ...

vocab_size, embedding_size = wv.vectors.shape

model.add(Embedding(input_dim=vocab_size,
                    output_dim=embedding_size,
                    weights=[wv.vectors],
                    input_length=max_length,
                    mask_zero=False,
                    trainable=True))

I want to get the embeddings of protein sequences by running your pre-trained ProteinBERT model with my own data (not from scratch, of course) and use it in a structure similar to the code above. I split my protein sequences into k-mers (seq: 'MSSRRSSRS', 'RQSGSSRIS' --> k-mers: [['MSS', 'RRS', 'SRS'], ['RQS', 'GSS', 'RIS']]) I had done training and classification. This is my plan for ProteinBERT, but I will act according to the most appropriate structure and idea from you.

Thank you very much in advance.

Kind regards.

nadavbra commented 8 months ago

@boncul I'm not sure I understand what you're trying to achieve exactly. Can you describe in a few sentences your high level goals for this project? What kind of model are you trying to train (i.e. what's the input and what's the desired output)?

boncul commented 8 months ago

As you asked me before, I want to obtain the vocabulary and vector, that is, numerical tensor values, resulting from training ProteinBERT with my own data set. I will try to use these values when digitizing my sequences and thus develop a more successful Bidirectional GRU model.

I have some sequences and labels in my data. e.g. from file:

family    sequence
bHLH      MSSRRSSRS...
C3H       RQSGSSRIS...

I want to divide these protein sequences into words (k-mers), get their vector equivalents with BERT (for example, 300 dimensions), just as we vectorize (take tensor values) the words using sentences consisting of words and get the tensor equivalents of these words in a model like BERT, and digitize these sequences:

[['MSS', 'RRS', 'SRS', ...], ['RQS', 'GSS', 'RIS', ...]]

After dividing it into words, I want to get the embedding of each word by training ProteinBERT with these sequences. I also want to create a vocabulary with these sequences. With this vocabulary, I will have obtained the tensor equivalents of the words. For example:

[[124, 1255, 650, ...], [5, 77, 4587, ...]]

I will use these obtained sequences in a Bidirectional GRU model, which I present as an example below:

# vocab_size is the number of words in the vocabulary I generate with ProteinBERT
# embedding_size is ProteinBERT vector (tensor) size (e.g. 300)

model = Sequential()
model.add(Embedding(input_dim=vocab_size,
                   output_dim=embedding_size,
                   weights=[<ProteinBERT tensors for all data>],
                   input_length=max_length,
                   mask_zero=False,
                   trainable=True))
model.add(Bidirectional(GRU(256)))    # or LSTM, or CNN, etc.
model.add(Flatten())
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

history = model.fit(
    train, y_train,
    epochs=100, batch_size=256,
    validation_data=(val, y_val)
    )

I understand that ProteinBERT treats protein sequences as a whole and that each amino acid can have different tensor values at different positions. But I am working on an innovation and I want to try and divide the proteins into words (k-mers) as in the example above, in order to capture the closeness and possibilities between amino acids and amino acid groups (for example, groups of three like in the sequences I mentioned above). I couldn't explain myself at first and kept you busy, please forgive me. Thank you for your interest and support. Please warn me and forgive me if I say anything technically incorrect/unreasonable about ProteinBERT. The following article we prepared before may give you an idea about what I want to do: https://dx.doi.org/10.1007/s11760-022-02419-5

nadavbra commented 8 months ago

@boncul You can use ProteinBERT to get local and global vector representation for each amino-acid sequence, including fixed-length kmers. However, by using it on fixed-length kmers you will be missing one of the main benefits of language models, which is their capacity to work with full-length sequences of varying lengths. If you include the kmers as part of the full protein sequences, the same kmer can have different embeddings depending on the context of the rest of the protein sequence (which I think is a nice feature). Therefore, if it makes sense for your application, I'd consider providing the full protein sequences to get the embeddings for the full-length sequences, and only then slice them for the positions you're interested in (which could be fixed-length kmers).