monum / 311-translation

MIT License
6 stars 13 forks source link

French to English translation task notebook #12

Open Gaurav7888 opened 1 year ago

Gaurav7888 commented 1 year ago

https://colab.research.google.com/drive/14KegLD0ymq4vTRzCjUvP77w9l-IGCsnj?usp=sharing

@mlevans @tejasvicsr1

ShubhamBhut commented 1 year ago

Hello! It appears that when using the model to make predictions on external input, the tokenization process may differ from what was used on the original input data. As a result, the model is unable to predict the correct output if the input is not from the dataset. Even if the same text (as external input) is mentioned in the dataset, still it is giving wrong prediction.

Below is the code I used for external prediction, the data preprocessing process is same as for the input dataset tmp_x. I got wrong prediction for this despite that bonjour is clearly mentioned multiple times in the training data

text = ["Bonjour", "mon cheri"]
text[0]
preprocess_x, tk_x = tokenize(text)
preprocess_x[0]
tmp_text = pad(preprocess_x, preproc_french_sentences.shape[1])
tmp_text = tmp_text.reshape((-1, preproc_french_sentences.shape[-2])) 

logits_to_text(loaded_model.predict(tmp_text[[1]])[0], english_tokenizer)
Gaurav7888 commented 1 year ago

Yes, for that we can use a tokenizer built on llm from hugging face and then it will give better results. I tried to build my own using the dataset.

Tirth678 commented 2 months ago

Hello hope your day is going great, This is my first contribution so apologies for any mistake

  1. You can try redefining 'tokenize' and 'pad' as

    from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences

def tokenize(sentences): tokenizer = Tokenizer() tokenizer.fit_on_texts(sentences) return tokenizer.texts_to_sequences(sentences), tokenizer

def pad(sequences, length): return pad_sequences(sequences, maxlen=length, padding='post')

if this goes well you can also try to reshape your input as predictions = loaded_model.predict(tmp_text) print("Predictions:", predictions)

I hope it serves you well