nadavbra / protein_bert

487 stars 102 forks source link

Graph execution error #83

Closed BenjyNStrauss closed 8 months ago

BenjyNStrauss commented 8 months ago

In trying to train ProteinBert, I've received the following error:

indices[31,1] = -127 is not in [0, 290)
     [[{{node model/embedding-seq-input/embedding_lookup}}]] [Op:__inference_train_function_30022]

I looked it up and apparently it's a known error that sometimes occurs with TensorFlow. However I am unsure of how to apply the fix from StackOverFlow [https://stackoverflow.com/questions/65514944/tensorflow-embeddings-invalidargumenterror-indices18-16-11905-is-not-in-0] in this situation.

The full text of the error is attached below: error.txt

I made some modification to proteinbert, which are the following: (I do not believe either of these modifications to be the source of the error, but I am including them to be safe)

(1) I modified the following line fromiteriterms() to items() to prevent an error:

log('Epoch sequence length distribution (for seq_len = %d): %s' % (self.seq_len, \
                    ', '.join('%s: %s' % item for item in pd.Series(seq_lengths).describe().iteritems())))

(2) I re-did the tokenizer to account for secondary structure, like so (had to change the extension to ".txt" to upload it: tokenization.txt

nadavbra commented 8 months ago

Could you clarify what you are trying to achieve? Are you trying to re-train the entire model on a different type of sequences?

BenjyNStrauss commented 8 months ago

The tokens of the original ProteinBERT only account for the primary structure. What I've done is slightly modify the tokens to: (1) allow the model to identify certain other types of amino acids (Pyrrolysine, Selenomethionine, Ornithine, and distinguish non-Amino acid ligands from amino acids) and (2) add a number after each amino acid letter to allow the model to base the predictions on secondary structure (local/2D structure) in addition to primary structure. (which is a number)

The sequences I fed in are in the form [primary-char][secondary-char], followed by a comma and a space.

I don't think these modifications triggered the error I was seeing above, but I included everything in case I'm wrong. It seemed to me that each token is basically like all the others, so I left the single-character tokens in the tokenizer in case someone wants to just use primary sequences. A sample input sequence might be: M9, V9, L9, S9, E1, G1, E1, W1, Q1, L1, V1, L1, H1, V1, W1, A1, K1, V1, E2, A2, D2, V1, A1, G1, H1, G1, Q1, D1, I1, L1, I1, R1, L1, F1, K1, S1, H9, P2, E2, T2, L2, E2, K2, F9, D8, R8, F8, K8, H8, L9, K9, T7, E1, A1, E1, M1, K1, A1, S9, E1, D1, L1, K1, K1, A1, G1, V1, T1, V1, L1, T1, A1, L1, G1, A1, I1, L1, K8, K8, K8, G8, H9, H9, E1, A1, E1, L1, K1, P1, L1, A1, Q1, S3, H3, A3, T3, K3, H7, K9, I9, P9, I1, K1, Y1, L1, E1, F1, I1, S1, E1, A1, I1, I1, H1, V1, L1, H1, S1, R1, H9, P8, G8, N8, F9, G9, A1, D1, A1, Q1, G1, A1, M1, N1, K1, A1, L1, E1, L1, F1, R1, K1, D1, I1, A1, A1, K1, Y1, K1, E1, L1, G8, Y9, Q9, G9

Example: L1 = Leucine, part of an alpha helix G9 = Glycine, random coil

Does this make sense? If not, I could try to explain another way. Thanks so much for your continued help.

nadavbra commented 8 months ago

Does the code produce errors when you train it on the original dataset (unmodified amino acids), or do errors only start once you introduce the new types of tokens?

BenjyNStrauss commented 8 months ago

Much to my surprise, it did run properly with the original tokenizer. (Given the nature of the error, I don't understand how it makes a difference)

It was on the line: self.model.fit(X, Y, sample_weight = sample_weights, batch_size = episode.batch_size, callbacks = self.fit_callbacks)

Edit: "https://github.com/tensorflow/tensorflow/issues/23698" (link doesn't work, have to copy+paste) says that it can be fixed by changing the vocabulary size of the model – but I'm not sure where it is set.

nadavbra commented 8 months ago

Your project is beyond the scope of what ProteinBERT was originally designed for. I'd try to seek help from the tensorflow community.

BenjyNStrauss commented 8 months ago

Is asking where you set the model vocabulary size beyond the scope?

Edit: Actually, it was in issue of "int8" vs "int16"