keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

Vector columns of dataframe should be output as array (batch_size, features) #231

Closed kmader closed 5 years ago

kmader commented 5 years ago

Using the new dataframe loader feature with 'raw' values as output should run np.stack on the outputs so it is a 2D array instead of a array of objects. I fix this by calling

 # posthoc correction
df_gen._targets = np.stack(df_gen.labels, 0)

After the generator has been made from the .flow_from_dataframe( command

Here is what the dataframe looks like (x_col="image_path", y_col="aller_vec") Screen Shot 2019-08-12 at 2 50 04 PM

rragundez commented 5 years ago

Hi @kmader thanks for submitting an issue.

I think you are using it wrong, raw just returns whatever you put in the dataframe without any changes, basically df[y_col] (hence the name raw). In your example I see that you concatenate to create the aller_vec you are essentially repeating information as you already have it in the columns. You just need raw and pass a list of the columns to y_col. Have a look at the docstring.

I will close this issue for now.

PD. I also do not recommend to separate the labels into "one-hot" encoder vector yourself. Instead you can keep in a list and use the mode categorical which now support multi-label outputs. here an example https://rragundez.io/keras-multi-label.html

kmader commented 5 years ago

@rragundez I see how I could have done it differently for the case above but for more general cases the PR #232 allows you to have higher dimensional outputs (without breaking anything else). Another common use case might be a sequence of tokens for image captioning problems.

rragundez commented 5 years ago

"without breaking anything else": I can have an "exotic" object in the data frame, then I call next() using the raw method retrieve the batch, do something with that "exotic" object and then use train_on_batch. As a dummy example take:

a = np.array([1, np.array([1,2])])
np.stack(a, axis=0)

This will brake. As I mentioned raw represents not doing anything to what the user decides to put in the dataframe.

The object can can be a list containing positive classes and arrays of bounding boxes for each image for example. Can be some GPS class or geometric class or ....

kmader commented 5 years ago

Ok, that makes sense. I think anything but arrays will clash when the batching occurs but I can appreciate the naming convention