Closed renedlog closed 5 years ago
Hi @renedlog thanks for submitting the issue. I agree with you that the documentation needs a lot of work. If I update it could you help me with reviewing from a user point of view?, I think code snippets are needed since the explanation is quite cumbersome. We are also thinking on adding examples folder with scripts for each use case.
About the problem you mentioned, I cannot reproduce it. The indices, labels and classes seem to be OK.
import random
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from keras_preprocessing.image import ImageDataGenerator
pixel_val = 1
filenames = []
for i in range(20):
filename = '/tmp/{}.jpg'.format(i)
plt.imsave(filename, pixel_val * np.random.uniform(size=(3, 3, 3)))
filenames.append(filename)
df = pd.DataFrame({'filename': filenames}).sample(frac=1).reset_index(drop=True)
classes = random.sample(['dog', 'cat', ['dog'], ['cat'], ['cat', 'dog'], ['dog', 'cat']] * 10, 20)
df['class'] = classes
generator = ImageDataGenerator().flow_from_dataframe(
df,
class_mode='categorical',
)
print(generator.class_indices)
>>> {'cat': 0, 'dog': 1}
print(df.assign(labels=generator.labels))
>>>
filename class labels
0 /tmp/5.jpg cat 0
1 /tmp/9.jpg [dog, cat] [1, 0]
2 /tmp/4.jpg [dog] [1]
3 /tmp/8.jpg cat 0
4 /tmp/18.jpg cat 0
5 /tmp/16.jpg [dog] [1]
6 /tmp/2.jpg dog 1
7 /tmp/14.jpg [dog, cat] [1, 0]
8 /tmp/11.jpg [cat, dog] [0, 1]
9 /tmp/10.jpg [cat, dog] [0, 1]
10 /tmp/3.jpg [dog] [1]
11 /tmp/0.jpg [cat] [0]
12 /tmp/17.jpg dog 1
13 /tmp/6.jpg dog 1
14 /tmp/19.jpg cat 0
15 /tmp/15.jpg [dog, cat] [1, 0]
16 /tmp/12.jpg [dog, cat] [1, 0]
17 /tmp/1.jpg [cat] [0]
18 /tmp/13.jpg [cat, dog] [0, 1]
19 /tmp/7.jpg dog 1
@rragundez extendet the code a bit. the problem seems not the generator somehow... really weird it does work with the MultiLabelBinarizer but somehow not with flow to flow_from_dataframe i've tested two different datasets that do show both the same behaviour but can't figure out the exact issue so far.
I don't understand what MultiLabelBinarizer
has to do with the issue. I can help you discover the reason, but first I need to see what the issue is? basically why did open this issue?
The problem are the labels even with perfect fit (over-fit as above). do not correspond the true label in a Multilabel case. This issue does though only occur when using flow_from_dataframe e.g. with img_to_array() and flow()
data = []
labels = []
# loop over the input images
for imagePath in imagePaths:
# load the image, pre-process it, and store it in the data list
image = cv2.imread(imagePath)
image = cv2.resize(image, (IMAGE_DIMS[1], IMAGE_DIMS[0]))
image = img_to_array(image)
data.append(image)
# extract set of class labels from the image path and update the
# labels list
l = label = imagePath.split(os.path.sep)[-2].split("_")
labels.append(l)
# scale the raw pixel intensities to the range [0, 1]
data = np.array(data, dtype="float") / 255.0
labels = np.array(labels)
# binarize the labels using scikit-learn's special multi-label
mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(labels)
# partition the data into training and testing splits using 80% of
# the data for training and the remaining 20% for testing
(trainX, testX, trainY, testY) = train_test_split(data,
labels, test_size=0.2, random_state=42)
# construct the image generator for data augmentation
aug = ImageDataGenerator(rotation_range=25, width_shift_range=0.1,
height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
horizontal_flip=True, fill_mode="nearest")
model = foo.SmallerVGGNet.build(
width=IMAGE_DIMS[1], height=IMAGE_DIMS[0],
depth=IMAGE_DIMS[2], classes=len(mlb.classes_),
finalAct="sigmoid")
opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="binary_crossentropy", optimizer=opt,
metrics=["accuracy"])
H = model.fit_generator(
aug.flow(trainX, trainY, batch_size=BS),
validation_data=(testX, testY),
steps_per_epoch=len(trainX) // BS,
epochs=EPOCHS, verbose=1)
I do not have this issue.
where is the problem in your example above?
I extended the reproducible example to include the actual output from the batches:
import random
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from keras_preprocessing.image import ImageDataGenerator
pixel_val = 1
filenames = []
for i in range(20):
filename = '/tmp/{}.jpg'.format(i)
plt.imsave(filename, pixel_val * np.random.uniform(size=(3, 3, 3)))
filenames.append(filename)
df = pd.DataFrame({'filename': filenames}).sample(frac=1).reset_index(drop=True)
classes = random.sample(['dog', 'cat', ['dog'], ['cat'], ['cat', 'dog'], ['dog', 'cat']] * 10, 20)
df['class'] = classes
generator = ImageDataGenerator().flow_from_dataframe(
df,
class_mode='categorical',
shuffle=False,
batch_size=len(df)
)
indices = next(generator)[1]
df = df.assign(labels_indices=generator.labels)
df['index_0_true_or_false'] = indices[:, 0]
df['index_1_true_or_false'] = indices[:, 1]
print(df.drop(columns='filename'))
>>>
class labels_indices index_0_true_or_false index_1_true_or_false
0 [dog] [1] 0.0 1.0
1 [cat, dog] [0, 1] 1.0 1.0
2 [cat] [0] 1.0 0.0
3 [dog] [1] 0.0 1.0
4 [dog] [1] 0.0 1.0
5 [cat] [0] 1.0 0.0
6 dog 1 0.0 1.0
7 [cat, dog] [0, 1] 1.0 1.0
8 [dog, cat] [1, 0] 1.0 1.0
9 [dog] [1] 0.0 1.0
10 [dog, cat] [1, 0] 1.0 1.0
11 [cat, dog] [0, 1] 1.0 1.0
12 [dog] [1] 0.0 1.0
13 [dog, cat] [1, 0] 1.0 1.0
14 cat 0 1.0 0.0
15 cat 0 1.0 0.0
16 [dog, cat] [1, 0] 1.0 1.0
17 [dog] [1] 0.0 1.0
18 [cat] [0] 1.0 0.0
19 [cat, dog] [0, 1] 1.0 1.0
As you can see everything is consistent. Can it be that you are using shuffle=True
and then try to compare back the output from the iteration with the original Dataframe? because then of course that won't work as the output from flow_from_dataframe is being shuffled
Don't think that's the problem. Atm. I guess it could be a compatibility issue. Keras is using Pillow and the Predict in the above example is openCV so that could clearly lead to this strange behaviour... (but still need to validate it).
Before it was with img_to_array it was CV2 and CV2.. with flow_from_dataframe its now Pillow and CV2. A warning in that case would for sure help.
I don't think the model is overfitting in your test script. You are creating color images and then assigning shared labels to them, I doubt a such a simple network can cope with that. And get acc of 99% or so on every label. (I also tried it)
I will close this issue now. It seems the problem does not reside within the DataFrameIterator class of
flow_from_dataframe`.
I've mad a multi label classification Network and using the 'categorical' as described in the docs of Keras. (Side note: The documentation really needs improvements for the different class_modes. It is not clear how to use them)
source
I do save the Tags and indices via: ModelLabels = (train_generator.class_indices) ModelLabels = dict((v, k) for k, v in ModelLabels.items())
The strange behavior is the binary representation of the tags does't look like the class_indices. So the tags are shuffled around an cannot be retried via train_generator.class_indices (after prediction).
The source of the image is a good example but to encounter the problem one has to add more images and tags where e.g. ['see','desert','mountains'] (note the different order) is present
I'll try to provide a short example. (soon)
up to here it looks ok;
will add further analysis (soon)
here is more code. a small neural network that overfits well (by purpose for this test)... but when validated with the input data it doesn't deliver what i would expect. e.g. pil_red.png should be ['a', 'c','d'] but is 'b' the same happens with bigger networks. (i don't know why).
if i'm using the MultiLabelBinarizer it does exactly what i want. (in the above example ['a', 'c','d'] -> probability rougly (1,0,1,1)