keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.91k stars 19.45k forks source link

Image Captioning example in keras Approach? #2295

Closed bhaveshoswal closed 3 years ago

bhaveshoswal commented 8 years ago

I have used the keras example code of Image Captioning in that I have used the VGG pretrained model for extracting image features(4096) and for text part I have done indexing to the unique words and post zero padding according the max caption length(which is equal the length of biggest sentence in data) and for next words I created a numpy array of (number of example, vocabulary size) vocabulary size is equal to number of unique words in data. The next words is a 1's 0's matrix 1 means present of word in the sentence 0 absence.

for predicting what should be given exactly in partial caption and in next words if we consider "cat sat on mat".

And in training data I am appending "START" and "END" tokens in the start and end of training captions.

here is my code:

correct me if I am doing wrong in code

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD, Adadelta, Adagrad
from keras.utils import np_utils, generic_utils
from keras.callbacks import EarlyStopping
from keras.layers.advanced_activations import PReLU, LeakyReLU
from keras.layers import Embedding,GRU,TimeDistributedDense,RepeatVector,Merge
from keras.preprocessing.text import one_hot
from keras.preprocessing import sequence
import cv2
import numpy as np

max_caption_len = 21
vocab_size = 43
def VGG_16(weights_path=None):
    model = Sequential()
    model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Flatten())
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000, activation='softmax'))

    if weights_path:
        model.load_weights(weights_path)

    #Remove the last two layers to get the 4096D activations
    model.layers.pop()
    model.layers.pop()

    return model
print "VGG loading"
image_model = VGG_16('vgg16_weights.h5')
image_model.trainable = False
print "VGG loaded"
# let's load the weights from a save file.
# image_model.load_weights('weight_file.h5')

# next, let's define a RNN model that encodes sequences of words
# into sequences of 128-dimensional word vectors.
print "Text model loading"
language_model = Sequential()
language_model.add(Embedding(vocab_size, 256, input_length=max_caption_len))
language_model.add(GRU(output_dim=128, return_sequences=True))
language_model.add(TimeDistributedDense(128))
print "Text model loaded"
# let's repeat the image vector to turn it into a sequence.
print "Repeat model loading"
image_model.add(RepeatVector(max_caption_len))
print "Repeat model loaded"
# the output of both models will be tensors of shape (samples, max_caption_len, 128).
# let's concatenate these 2 vector sequences.
print "Merging"
model = Sequential()
model.add(Merge([image_model, language_model], mode='concat', concat_axis=-1))
# let's encode this vector sequence into a single vector
model.add(GRU(256, return_sequences=False))
# which will be used to compute a probability
# distribution over what the next word in the caption should be!
model.add(Dense(vocab_size))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
print "Merged"
# "images" is a numpy float array of shape (nb_samples, nb_channels=3, width, height).
# "captions" is a numpy integer array of shape (nb_samples, max_caption_len)
# containing word index sequences representing partial captions.
# "next_words" is a numpy float array of shape (nb_samples, vocab_size)
# containing a categorical encoding (0s and 1s) of the next word in the corresponding
# partial caption.
print "Data preprocessig"
Texts = ["START A girl is stretched out in shallow water END",
        "START The two people stand by a body of water and in front of bushes in fall END",
        "START A blonde horse and a blonde girl in a black sweatshirt are staring at a fire in a barrel END",
        "START Children sit and watch the fish moving in the pond END",
        "START A fisherman fishes at the bank of a foggy river END"]

Images = ["667626_18933d713e.jpg",
         "3637013_c675de7705.jpg",
         "10815824_2997e03d76.jpg",
         "12830823_87d2654e31.jpg",
         "17273391_55cfc7d3d4.jpg"]
images = []
for image in Images:
    img = cv2.imread(image)
    img.resize((3,224,224))
    images.append(img)
images = np.asarray(images)

words = [txt.split() for txt in Texts]
unique = []
for word in words:
    unique.extend(word)
unique = list(set(unique))
word_index = {}
index_word = {}
for i,word in enumerate(unique):
    word_index[word] = i
    index_word[i] = word

partial_captions = []
for text in Texts:
    one = [word_index[txt] for txt in text.split()]
    partial_captions.append(one)

partial_captions = sequence.pad_sequences(partial_captions, maxlen=max_caption_len,padding='post')
next_words = np.zeros((5,vocab_size))
for i,text in enumerate(Texts):
    text = text.split()
    x = [word_index[txt] for txt in text]
    x = np.asarray(x)
    next_words[i,x] = 1

print "Data preprocessing done"
model.fit([images, partial_captions], next_words, batch_size=1, nb_epoch=5)

for this task i am taking only 5 example to know the working of this model and also I want to know whether my approach is right or wrong. Afterwords I am going to use flickr8k dataset for the same

deepnarainsingh commented 8 years ago

@bhaveshoswal what is the dataset which you are using ?

deepnarainsingh commented 8 years ago

@bhaveshoswal @shashankg7 when i run this code i get a error "All input arrays and the target array must have the same number of samples." Can you please help on the same ?

bhaveshoswal commented 8 years ago

You have to take any five images and five questions on it as I have taken in Texts and Images variable then code will run and make all images size to (224,224)

deepnarainsingh commented 8 years ago

@bhaveshoswal hi thanks i was able to resolve it. did u tried ur code on flickr data ? in that there are 5 captions for one image , how are you trying to handle that? have u completed training on flickr data ?

bhaveshoswal commented 8 years ago

I Tried on 5 images and their 5 captions from flicker data

elliottd commented 8 years ago

@bhaveshoswal @deepnarainsingh if you are still trying to implement this type of model in Keras (without the ability to finetune the ConvNet), I have a working implementation in this repository.

wiraindrak commented 7 years ago

@bhaveshoswal thanks for your sharing, how to handle real flickr data? are you used data_generator? please explain me, i need that. regards.

bhaveshoswal commented 7 years ago

@kucingit3m not using data_generator just reshape the image size to (224,224) and do embedding on caption for five caption repeat the image vector five times for it in training data hope thats helps you.

wiraindrak commented 7 years ago

@bhaveshoswal thanks! thats help me. but why my loss model always getting up every epoch, did you ever experienced that?

indra215 commented 7 years ago

@bhaveshoswal thanks for the explanation. I have given the input just as you explained above but my loss kept on increasing for every epoch. Did you face such issue while training ?

Also could you also explain how to give the test data. I mean there is only a image present for which we need to get the caption. So how do we give the data for testing ?

Thanks in advance.

bhaveshoswal commented 7 years ago

@elliottd thanks for your implementation bur it is totally different from what I am trying to do.

bhaveshoswal commented 7 years ago

@kucingit3m that's same with me

bhaveshoswal commented 7 years ago

@indra215 you have to give two inputs first image of size (224,224) and other partial caption for example [START A] according to the image you give

indra215 commented 7 years ago

@bhaveshoswal thanks for the reply. I've followed your example code above (on a much larger dataset) where in you used the entire caption in the partial_captions and in the next_words, you gave the 0 and 1 encoding of the words present in the partial_captions. But where is this partial caption meaning coming in the data ? I mean you are giving the entire caption in the partial_captions.

Did you successfully train this model and generated captions on some real data ? If so could you please help me in the code on how to give the input data for training the model.

Also how to give the data while testing on a new image where we don't have any partial_caption there.

Thank you in advance.

junyongyou commented 7 years ago

Hi, it seems the image captioning example has gone. @bhaveshoswal Did you finish your codes?

bhomass commented 7 years ago

where is the image caption code? I just git cloned the keras code, and there is nothing in there for captioning.

anuragmishracse commented 7 years ago

@junyongyou @bhomass This might be of your help https://github.com/anuragmishracse/caption_generator .

PavlosMelissinos commented 7 years ago

The most recent version of the captioning example can be found here. It's an older commit (0b2c044, 8 March 2017) of the keras repo.

jonilaserson commented 7 years ago

Why was it removed from the set of examples?

On Mon, Jun 26, 2017 at 12:11 PM, Pavlos notifications@github.com wrote:

The latest version of the captioning example can be found here https://github.com/fchollet/keras/blob/0b2c044d48a335e97ffef4b6c76031fa12627ec9/docs/templates/getting-started/sequential-model-guide.md#architecture-for-learning-image-captions-with-a-convnet-and-a-gated-recurrent-unit. It's an older commit (8 March 2017) of the keras repo.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/2295#issuecomment-311005030, or mute the thread https://github.com/notifications/unsubscribe-auth/AFdLCHeNfetGEjS0RIuQM3mMFCFcn9TAks5sH3XPgaJpZM4IGGhT .

PavlosMelissinos commented 7 years ago

Not sure... It works with keras 2 with minor modifications so I don't see a practical reason. Maybe the application was just too niche?

oarriaga commented 7 years ago

Hello, if someone is interested you can find an image captioning in keras 2.0 here

jtoy commented 7 years ago

thanks!

SJameer commented 7 years ago

OSError: Unable to open file (Unable to open file: name = 'vgg16_weights.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)

getting this error after writing the convolution layers.any new updated code for this example. please send provide the link for practise

anuragmishracse commented 7 years ago

@SJameer You can use vgg16 in keras.applications, here: https://keras.io/applications/#vgg16

iam-shaleen commented 6 years ago

@anuragmishracse - how much time does it take for the language model ? My program is running for past 2 hrs on a GPU, still no result .....