Open Gaurav7888 opened 1 year ago
Hello! It appears that when using the model to make predictions on external input, the tokenization process may differ from what was used on the original input data. As a result, the model is unable to predict the correct output if the input is not from the dataset. Even if the same text (as external input) is mentioned in the dataset, still it is giving wrong prediction.
Below is the code I used for external prediction, the data preprocessing process is same as for the input dataset tmp_x
. I got wrong prediction for this despite that bonjour is clearly mentioned multiple times in the training data
text = ["Bonjour", "mon cheri"]
text[0]
preprocess_x, tk_x = tokenize(text)
preprocess_x[0]
tmp_text = pad(preprocess_x, preproc_french_sentences.shape[1])
tmp_text = tmp_text.reshape((-1, preproc_french_sentences.shape[-2]))
logits_to_text(loaded_model.predict(tmp_text[[1]])[0], english_tokenizer)
Yes, for that we can use a tokenizer built on llm from hugging face and then it will give better results. I tried to build my own using the dataset.
Hello hope your day is going great, This is my first contribution so apologies for any mistake
You can try redefining 'tokenize' and 'pad' as
from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences
def tokenize(sentences): tokenizer = Tokenizer() tokenizer.fit_on_texts(sentences) return tokenizer.texts_to_sequences(sentences), tokenizer
def pad(sequences, length): return pad_sequences(sequences, maxlen=length, padding='post')
if this goes well you can also try to reshape your input as predictions = loaded_model.predict(tmp_text) print("Predictions:", predictions)
I hope it serves you well
https://colab.research.google.com/drive/14KegLD0ymq4vTRzCjUvP77w9l-IGCsnj?usp=sharing
@mlevans @tejasvicsr1