keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

ImageDataGenerator.flow_from_dataframe keeps loading when directory has subdirectories #96

Closed fahadakhan96 closed 5 years ago

fahadakhan96 commented 5 years ago

I'm working on the MURA dataset by Stanford. I'm trying to load the dataset using Keras's ImageDataGenerator. The data is in the following hierarchy:

The directory hierarchy

The study1_positive folder contains the images.

ImageDataGenerator.flow_from_directory cannot be used with this folder structure, therefore I tried using the flow_from_dataframe method.

However, when run, the code keeps on executing and doesn't stop.

Following is the format of the Pandas DataFrame that I'm passing to the flow_from_directory method:

The DataFrame passed to flow_from_dataframe

I've also tried changing the labels to 'abnormal' and 'normal' in place of 1 and 0, respectively.

Below is the code:

train_imggen = ImageDataGenerator(rescale=1./255, rotation_range=30,
                              horizontal_flip=True)

train_loader = train_imggen.flow_from_dataframe(traindf, './', shuffle=True,
                                            x_col='path', y_col='label',
                                            color_mode='grayscale',
                                            target_size=(320,320), 
                                            class_mode='binary', 
                                            batch_size=8)
smurak commented 5 years ago

I guess you have a lot of files in directory ('./'). Here's how the flow_from_dataframe works:

  1. Make the list of all images in directory.
  2. Make the list of all filenames in the input dataframe which is in the above list.

And as I mentioned in #93, the current flow_from_dataframe does not support relative paths.

So could you check if the following steps works?:

  1. Clone my "fix_found_0_images" branch. git clone -b fix_found_0_images_bug https://github.com/smurak/keras-preprocessing.git
  2. Move the "keras_preprocessing" subdirectory to your working directory.
  3. Import it.
    import keras
    from keras_preprocessing import image
    ...
    train_imggen = image.ImageDataGenerator(...)
  4. Drop "MURA-v1.1/" from "path" column in your dataframe and set directory to "./MURA-v1.1" train_loader = train_imggen.flow_from_dataframe(traindf, './MURA-v1.1', ...), OR change "path" to absolute paths and set directory to None. train_loader = train_imggen.flow_from_dataframe(traindf, None, ...)
fahadakhan96 commented 5 years ago

Thanks, @smurak!

Your fix worked! Didn't need Step 4, though.

Please feel free to close this issue.

kindofausername commented 5 years ago

Thanks, @smurak! Great

It is working now with absolute paths! Step 4. kind is not working on my system.

yasar-rehman commented 5 years ago

Here is my code using absolute path: @Vijayabhaskar96 @smurak

train_df = pd.DataFrame(train_img_data)
train_df.columns = ['id', 'label']

test_df = pd.DataFrame(test_img_data)
test_df.columns = ['id', 'label']

print(train_df['id'][0])
print('******************************************************')

datagen = ImageDataGenerator(rescale=1./255)

train_generator = datagen.flow_from_dataframe(train_df, None,
    x_col='id',
    y_col='label',
    has_ext=True,
    batch_size=args.batch_size,
    seed=42,
    shuffle=True,
    class_mode="sparse",
    target_size=(224,224),
    color_mode='rgb',
    interpolation='nearest'
)

After running the above code, I got the following error:

/home/yaurehman2/Documents/Newwork/REPLY_ATTACK_FACE_Mod_corr/train/0.jpg


/home/yaurehman2/anaconda3/envs/virtual-tf2/lib/python3.5/site-packages/keras_preprocessing/image.py:2059: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.df[x_col] = self.df[x_col].astype(str) Traceback (most recent call last): File "tutorial_pd.py", line 394, in main(parser_arguments(sys.argv[1:])) File "tutorial_pd.py", line 242, in main interpolation='nearest' File "/home/yaurehman2/anaconda3/envs/virtual-tf2/lib/python3.5/site-packages/keras_preprocessing/image.py", line 1107, in flow_from_dataframe interpolation=interpolation) File "/home/yaurehman2/anaconda3/envs/virtual-tf2/lib/python3.5/site-packages/keras_preprocessing/image.py", line 2095, in init df=True) File "/home/yaurehman2/anaconda3/envs/virtual-tf2/lib/python3.5/site-packages/keras_preprocessing/image.py", line 1764, in _list_valid_filenames_in_directory dirname = os.path.basename(directory) File "/home/yaurehman2/anaconda3/envs/virtual-tf2/lib/python3.5/posixpath.py", line 139, in basename i = p.rfind(sep) + 1 AttributeError: 'NoneType' object has no attribute 'rfind'

yasar-rehman commented 5 years ago

update: @Vijayabhaskar96 @smurak

After downloading the update by @smurak, my code is working now on absolute paths.

However, I've found one more problem. It cannot deal with duplicate files names. For example, to balance my data , I duplicate some files name in the training set. However, the .flow_from_dataframe only shows me the actual number of files in the training set. Also it trains on actual number of files in the directory and not on the modified number of files.

As an example my training data contain two classes: class 1 with 6000, class 2 with 30,000.

To balance both data sets, I duplicate the class1 5 times, so my new balance training data set has 60,000 samples.

However, .flow_from_dataframe shows that It only found 36000 samples with 2 classes.

my batch size is 32,thus 36000/32 = 1125 whereas, I it should be 60000/32 = 1875

Here is the output:

Live samples are 6000 , attack samples are 30000 The difference is :5 Balanced data samples: 60000 Found 36000 images belonging to 2 classes. Epoch 1/1 224/1125 [====>.........................] - ETA: 1:38 - loss: 0.3755 - acc: 0.8602

Vijayabhaskar96 commented 5 years ago

@smurak 's fix was temporary,it was fixed and updated, you should be fine if you have installed the latest github version,instead of the pip version,and for the duplicates set drop_duplicates=False.

yasar-rehman commented 5 years ago

Thank you for your prompt response and guidance, @Vijayabhaskar96

It's working now!

Here is the updated code for generator:

train_generator = datagen.flow_from_dataframe( dataframe=train_df, directory=None, x_col='id', y_col='label', has_ext=True, batch_size=args.batch_size, seed=42, shuffle=True, class_mode="sparse", target_size=(224,224), color_mode='rgb', interpolation='nearest', drop_duplicates=False )

Here is the output as a result of the above code:


Live samples are 6000 , attack samples are 30000 The difference is :5 Balanced data samples: 60000 Found 60000 images belonging to 2 classes. Found 36000 images belonging to 2 classes. Epoch 1/1 298/1875 [===>..........................] - ETA: 2:48 - loss: 0.3638 - acc: 0.8965

rragundez commented 5 years ago

@Dref360 this issue should be closed