Closed BenjyNStrauss closed 8 months ago
Could you clarify what you are trying to achieve? Are you trying to re-train the entire model on a different type of sequences?
The tokens of the original ProteinBERT only account for the primary structure. What I've done is slightly modify the tokens to: (1) allow the model to identify certain other types of amino acids (Pyrrolysine, Selenomethionine, Ornithine, and distinguish non-Amino acid ligands from amino acids) and (2) add a number after each amino acid letter to allow the model to base the predictions on secondary structure (local/2D structure) in addition to primary structure. (which is a number)
The sequences I fed in are in the form [primary-char][secondary-char], followed by a comma and a space.
I don't think these modifications triggered the error I was seeing above, but I included everything in case I'm wrong. It seemed to me that each token is basically like all the others, so I left the single-character tokens in the tokenizer in case someone wants to just use primary sequences. A sample input sequence might be:
M9, V9, L9, S9, E1, G1, E1, W1, Q1, L1, V1, L1, H1, V1, W1, A1, K1, V1, E2, A2, D2, V1, A1, G1, H1, G1, Q1, D1, I1, L1, I1, R1, L1, F1, K1, S1, H9, P2, E2, T2, L2, E2, K2, F9, D8, R8, F8, K8, H8, L9, K9, T7, E1, A1, E1, M1, K1, A1, S9, E1, D1, L1, K1, K1, A1, G1, V1, T1, V1, L1, T1, A1, L1, G1, A1, I1, L1, K8, K8, K8, G8, H9, H9, E1, A1, E1, L1, K1, P1, L1, A1, Q1, S3, H3, A3, T3, K3, H7, K9, I9, P9, I1, K1, Y1, L1, E1, F1, I1, S1, E1, A1, I1, I1, H1, V1, L1, H1, S1, R1, H9, P8, G8, N8, F9, G9, A1, D1, A1, Q1, G1, A1, M1, N1, K1, A1, L1, E1, L1, F1, R1, K1, D1, I1, A1, A1, K1, Y1, K1, E1, L1, G8, Y9, Q9, G9
Example: L1 = Leucine, part of an alpha helix G9 = Glycine, random coil
Does this make sense? If not, I could try to explain another way. Thanks so much for your continued help.
Does the code produce errors when you train it on the original dataset (unmodified amino acids), or do errors only start once you introduce the new types of tokens?
Much to my surprise, it did run properly with the original tokenizer. (Given the nature of the error, I don't understand how it makes a difference)
It was on the line:
self.model.fit(X, Y, sample_weight = sample_weights, batch_size = episode.batch_size, callbacks = self.fit_callbacks)
Edit: "https://github.com/tensorflow/tensorflow/issues/23698" (link doesn't work, have to copy+paste) says that it can be fixed by changing the vocabulary size of the model – but I'm not sure where it is set.
Your project is beyond the scope of what ProteinBERT was originally designed for. I'd try to seek help from the tensorflow community.
Is asking where you set the model vocabulary size beyond the scope?
Edit: Actually, it was in issue of "int8" vs "int16"
In trying to train ProteinBert, I've received the following error:
I looked it up and apparently it's a known error that sometimes occurs with TensorFlow. However I am unsure of how to apply the fix from StackOverFlow [https://stackoverflow.com/questions/65514944/tensorflow-embeddings-invalidargumenterror-indices18-16-11905-is-not-in-0] in this situation.
The full text of the error is attached below: error.txt
I made some modification to proteinbert, which are the following: (I do not believe either of these modifications to be the source of the error, but I am including them to be safe)
(1) I modified the following line from
iteriterms()
toitems()
to prevent an error:(2) I re-did the tokenizer to account for secondary structure, like so (had to change the extension to ".txt" to upload it: tokenization.txt