keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

flow_from_dataframe randomly shuffles Tags #195

Closed renedlog closed 5 years ago

renedlog commented 5 years ago

I've mad a multi label classification Network and using the 'categorical' as described in the docs of Keras. (Side note: The documentation really needs improvements for the different class_modes. It is not clear how to use them)

image source

I do save the Tags and indices via: ModelLabels = (train_generator.class_indices) ModelLabels = dict((v, k) for k, v in ModelLabels.items())

The strange behavior is the binary representation of the tags does't look like the class_indices. So the tags are shuffled around an cannot be retried via train_generator.class_indices (after prediction).

The source of the image is a good example but to encounter the problem one has to add more images and tags where e.g. ['see','desert','mountains'] (note the different order) is present

I'll try to provide a short example. (soon)

up to here it looks ok;

# import the necessary packages
import pandas as pd
from PIL import Image
from keras.layers import Dense, Activation, Flatten
from keras.models import Sequential
from keras_preprocessing.image import ImageDataGenerator

img = Image.new('RGB', (1, 1), color='red')
img.save('pil_red.png')
img = Image.new('RGB', (1, 1), color='blue')
img.save('pil_blue.png')
img = Image.new('RGB', (1, 1), color='white')
img.save('pil_white.png')
img = Image.new('RGB', (1, 1), color='green')
img.save('pil_green.png')
img = Image.new('RGB', (1, 1), color='black')
img.save('pil_black.png')

data = pd.DataFrame(
    {'Image': ['pil_red.png', 'pil_blue.png', 'pil_white.png', 'pil_green.png', 'pil_black.png'],
     'Labels': [['a', 'c','d'], ['c', 'a'], 'b', 'c', ['d','a', 'b', 'c']]})

generator = ImageDataGenerator().flow_from_dataframe(
    data, x_col='Image', y_col='Labels',
    batch_size=10,
    target_size=(1, 1),
    shuffle=False,
    color_mode='rgb',
    classes=['a', 'b', 'c','d'],
    class_mode='categorical'
)

out = generator.next()

data['outimage'] = list(out[0])
data['outbinary'] = list(out[1])

print(data)
ModelLabels = (generator.class_indices)
ModelLabels = dict((v, k) for k, v in ModelLabels.items())
print(ModelLabels)
############################################################
# looks ok up to here

will add further analysis (soon)

here is more code. a small neural network that overfits well (by purpose for this test)... but when validated with the input data it doesn't deliver what i would expect. e.g. pil_red.png should be ['a', 'c','d'] but is 'b' the same happens with bigger networks. (i don't know why).

if i'm using the MultiLabelBinarizer it does exactly what i want. (in the above example ['a', 'c','d'] -> probability rougly (1,0,1,1)

datagen = ImageDataGenerator(rescale=1./255, validation_split=0.5)

generator = datagen.flow_from_dataframe(
    data, x_col='Image', y_col='Labels',
    batch_size=1,
    target_size=(1,1),
    color_mode='rgb',
    classes=['a', 'b', 'c','d'],
    class_mode='categorical',
    subset='training'
)

ModelLabels = (generator.class_indices)
ModelLabels = dict((v, k) for k, v in ModelLabels.items())

model = Sequential()
model.add(Flatten(input_shape=(1,1,3)))
model.add(Dense(4))
Activation('linear'),
Dense(100),
model.add(Activation('sigmoid'))

from keras.optimizers import Adam

EPOCHS = 100
INIT_LR = 1e-3
opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)

# distribution
model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"])

# train the network
print("[INFO] training network...")
H = model.fit_generator(
    generator=generator,
    steps_per_epoch=EPOCHS,
    epochs=EPOCHS, verbose=1,
    use_multiprocessing=False,
    workers=6)

import cv2
from keras.preprocessing.image import img_to_array
import numpy as np

image = cv2.imread('pil_red.png')

# pre-process the image for classification
image = cv2.resize(image, (1, 1))
image = image.astype("float") / 255.0
image = img_to_array(image)
image = np.expand_dims(image, axis=0)

# classify the input image then find the indexes of the two class
# labels with the *largest* probability
print("[INFO] classifying image...")
proba = model.predict(image)[0]
idxs = np.argsort(proba)[::-1][:2]
rragundez commented 5 years ago

Hi @renedlog thanks for submitting the issue. I agree with you that the documentation needs a lot of work. If I update it could you help me with reviewing from a user point of view?, I think code snippets are needed since the explanation is quite cumbersome. We are also thinking on adding examples folder with scripts for each use case.

About the problem you mentioned, I cannot reproduce it. The indices, labels and classes seem to be OK.

import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from keras_preprocessing.image import ImageDataGenerator

pixel_val = 1
filenames = []
for i in range(20):
    filename = '/tmp/{}.jpg'.format(i)
    plt.imsave(filename, pixel_val * np.random.uniform(size=(3, 3, 3)))
    filenames.append(filename)

df = pd.DataFrame({'filename': filenames}).sample(frac=1).reset_index(drop=True)
classes = random.sample(['dog', 'cat', ['dog'], ['cat'], ['cat', 'dog'], ['dog', 'cat']] * 10, 20)
df['class'] = classes
generator = ImageDataGenerator().flow_from_dataframe(
    df,
    class_mode='categorical',
)
print(generator.class_indices)
>>> {'cat': 0, 'dog': 1}
print(df.assign(labels=generator.labels))
>>> 
       filename       class  labels
0    /tmp/5.jpg         cat       0
1    /tmp/9.jpg  [dog, cat]  [1, 0]
2    /tmp/4.jpg       [dog]     [1]
3    /tmp/8.jpg         cat       0
4   /tmp/18.jpg         cat       0
5   /tmp/16.jpg       [dog]     [1]
6    /tmp/2.jpg         dog       1
7   /tmp/14.jpg  [dog, cat]  [1, 0]
8   /tmp/11.jpg  [cat, dog]  [0, 1]
9   /tmp/10.jpg  [cat, dog]  [0, 1]
10   /tmp/3.jpg       [dog]     [1]
11   /tmp/0.jpg       [cat]     [0]
12  /tmp/17.jpg         dog       1
13   /tmp/6.jpg         dog       1
14  /tmp/19.jpg         cat       0
15  /tmp/15.jpg  [dog, cat]  [1, 0]
16  /tmp/12.jpg  [dog, cat]  [1, 0]
17   /tmp/1.jpg       [cat]     [0]
18  /tmp/13.jpg  [cat, dog]  [0, 1]
19   /tmp/7.jpg         dog       1
renedlog commented 5 years ago

@rragundez extendet the code a bit. the problem seems not the generator somehow... really weird it does work with the MultiLabelBinarizer but somehow not with flow to flow_from_dataframe i've tested two different datasets that do show both the same behaviour but can't figure out the exact issue so far.

rragundez commented 5 years ago

I don't understand what MultiLabelBinarizer has to do with the issue. I can help you discover the reason, but first I need to see what the issue is? basically why did open this issue?

renedlog commented 5 years ago

The problem are the labels even with perfect fit (over-fit as above). do not correspond the true label in a Multilabel case. This issue does though only occur when using flow_from_dataframe e.g. with img_to_array() and flow()

data = []
labels = []

# loop over the input images
for imagePath in imagePaths:
    # load the image, pre-process it, and store it in the data list
    image = cv2.imread(imagePath)
    image = cv2.resize(image, (IMAGE_DIMS[1], IMAGE_DIMS[0]))
    image = img_to_array(image)
    data.append(image)

    # extract set of class labels from the image path and update the
    # labels list
    l = label = imagePath.split(os.path.sep)[-2].split("_")
    labels.append(l)

# scale the raw pixel intensities to the range [0, 1]
data = np.array(data, dtype="float") / 255.0
labels = np.array(labels)

# binarize the labels using scikit-learn's special multi-label
mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(labels)

# partition the data into training and testing splits using 80% of
# the data for training and the remaining 20% for testing
(trainX, testX, trainY, testY) = train_test_split(data,
    labels, test_size=0.2, random_state=42)

# construct the image generator for data augmentation
aug = ImageDataGenerator(rotation_range=25, width_shift_range=0.1,
    height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
    horizontal_flip=True, fill_mode="nearest")

model = foo.SmallerVGGNet.build(
    width=IMAGE_DIMS[1], height=IMAGE_DIMS[0],
    depth=IMAGE_DIMS[2], classes=len(mlb.classes_),
    finalAct="sigmoid")

opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)

model.compile(loss="binary_crossentropy", optimizer=opt,
    metrics=["accuracy"])

H = model.fit_generator(
    aug.flow(trainX, trainY, batch_size=BS),
    validation_data=(testX, testY),
    steps_per_epoch=len(trainX) // BS,
    epochs=EPOCHS, verbose=1)

I do not have this issue.

rragundez commented 5 years ago

where is the problem in your example above?

rragundez commented 5 years ago

I extended the reproducible example to include the actual output from the batches:

import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from keras_preprocessing.image import ImageDataGenerator

pixel_val = 1
filenames = []
for i in range(20):
    filename = '/tmp/{}.jpg'.format(i)
    plt.imsave(filename, pixel_val * np.random.uniform(size=(3, 3, 3)))
    filenames.append(filename)

df = pd.DataFrame({'filename': filenames}).sample(frac=1).reset_index(drop=True)
classes = random.sample(['dog', 'cat', ['dog'], ['cat'], ['cat', 'dog'], ['dog', 'cat']] * 10, 20)
df['class'] = classes
generator = ImageDataGenerator().flow_from_dataframe(
    df,
    class_mode='categorical',
    shuffle=False,
    batch_size=len(df)
)
indices = next(generator)[1]
df = df.assign(labels_indices=generator.labels)
df['index_0_true_or_false'] = indices[:, 0]
df['index_1_true_or_false'] = indices[:, 1]
print(df.drop(columns='filename'))
>>>
         class labels_indices  index_0_true_or_false  index_1_true_or_false
0        [dog]            [1]                    0.0                    1.0
1   [cat, dog]         [0, 1]                    1.0                    1.0
2        [cat]            [0]                    1.0                    0.0
3        [dog]            [1]                    0.0                    1.0
4        [dog]            [1]                    0.0                    1.0
5        [cat]            [0]                    1.0                    0.0
6          dog              1                    0.0                    1.0
7   [cat, dog]         [0, 1]                    1.0                    1.0
8   [dog, cat]         [1, 0]                    1.0                    1.0
9        [dog]            [1]                    0.0                    1.0
10  [dog, cat]         [1, 0]                    1.0                    1.0
11  [cat, dog]         [0, 1]                    1.0                    1.0
12       [dog]            [1]                    0.0                    1.0
13  [dog, cat]         [1, 0]                    1.0                    1.0
14         cat              0                    1.0                    0.0
15         cat              0                    1.0                    0.0
16  [dog, cat]         [1, 0]                    1.0                    1.0
17       [dog]            [1]                    0.0                    1.0
18       [cat]            [0]                    1.0                    0.0
19  [cat, dog]         [0, 1]                    1.0                    1.0

As you can see everything is consistent. Can it be that you are using shuffle=True and then try to compare back the output from the iteration with the original Dataframe? because then of course that won't work as the output from flow_from_dataframe is being shuffled

renedlog commented 5 years ago

Don't think that's the problem. Atm. I guess it could be a compatibility issue. Keras is using Pillow and the Predict in the above example is openCV so that could clearly lead to this strange behaviour... (but still need to validate it).

Before it was with img_to_array it was CV2 and CV2.. with flow_from_dataframe its now Pillow and CV2. A warning in that case would for sure help.

rragundez commented 5 years ago

I don't think the model is overfitting in your test script. You are creating color images and then assigning shared labels to them, I doubt a such a simple network can cope with that. And get acc of 99% or so on every label. (I also tried it) I will close this issue now. It seems the problem does not reside within the DataFrameIterator class offlow_from_dataframe`.