My experimental mini projects exploring Multimodal Gen AI, Quantum Computing lab projects that I develop on Qiskit and decentralized exchanges that I develop for my blockchain learning experience.
The Caption Generator project is designed to generate descriptive captions for images using deep learning models. This involves several steps, including image preprocessing, feature extraction using convolutional neural networks (CNNs), and sequence prediction using recurrent neural networks (RNNs) or transformers.
The notebook starts by importing necessary libraries, such as TensorFlow/Keras for building and training models, NumPy for numerical operations, and Matplotlib for visualizations.
import tensorflow as tf
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Dense, add
from tensorflow.keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt
import pickle
import os
The notebook includes code to load the image dataset and preprocess it. This involves resizing images, normalizing pixel values, and converting them into arrays suitable for model input.
def preprocess_image(img_path):
img = image.load_img(img_path, target_size=(299, 299))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = tf.keras.applications.inception_v3.preprocess_input(x)
return x
A pre-trained InceptionV3 model is used to extract features from images. The model is modified to remove the top layers, leaving the convolutional base for feature extraction.
inception_model = InceptionV3(weights='imagenet')
model_new = Model(inception_model.input, inception_model.layers[-2].output)
def encode_image(img):
img = preprocess_image(img)
fea_vec = model_new.predict(img)
fea_vec = np.reshape(fea_vec, fea_vec.shape[1])
return fea_vec
The project uses an RNN (LSTM in this case) to generate captions. The model consists of an embedding layer, an LSTM layer, and dense layers to predict the next word in the sequence.
def build_model(vocab_size, max_length):
inputs1 = tf.keras.Input(shape=(2048,))
fe1 = Dense(256, activation='relu')(inputs1)
fe2 = RepeatVector(max_length)(fe1)
inputs2 = tf.keras.Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = LSTM(256, return_sequences=True)(se1)
decoder1 = add([fe2, se2])
decoder2 = LSTM(256)(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
return model
The notebook includes steps to train the model using the dataset. This involves feeding the image features and corresponding captions to the model and optimizing the weights.
# Example of training code
model = build_model(vocab_size, max_length)
model.fit([features, sequences], targets, epochs=20, verbose=2)
After training, the model can generate captions for new images. The generation process involves using the model to predict the next word in the sequence until a stop token is generated or the maximum length is reached.
def generate_caption(model, photo, tokenizer, max_length):
in_text = 'startseq'
for i in range(max_length):
sequence = tokenizer.texts_to_sequences([in_text])[0]
sequence = pad_sequences([sequence], maxlen=max_length)
yhat = model.predict([photo,sequence], verbose=0)
yhat = np.argmax(yhat)
word = tokenizer.index_word[yhat]
if word is None:
break
in_text += ' ' + word
if word == 'endseq':
break
final_caption = in_text.split()[1:-1]
final_caption = ' '.join(final_caption)
return final_caption
Here are some test cases to demonstrate the functionality of the project:
# Load and preprocess the image
image_path = 'path_to_image.jpg'
photo = encode_image(image_path)
# Generate caption
caption = generate_caption(model, photo, tokenizer, max_length)
print("Generated Caption:", caption)
# Evaluate the model on a validation dataset
def evaluate_model(model, photos, descriptions, tokenizer, max_length):
actual, predicted = list(), list()
for key, desc_list in descriptions.items():
yhat = generate_caption(model, photos[key], tokenizer, max_length)
actual.append([d.split() for d in desc_list])
predicted.append(yhat.split())
# Compute BLEU score
bleu = corpus_bleu(actual, predicted)
return bleu
bleu_score = evaluate_model(model, test_features, test_descriptions, tokenizer, max_length)
print("BLEU Score:", bleu_score)
The Caption Generator project combines image processing, feature extraction, and sequence prediction to generate descriptive captions for images. The notebook walks through loading data, preprocessing, building and training the model, and generating captions. By following the code and test cases, users can understand the workflow and customize the model for their specific needs.