modAL-python / modAL

A modular active learning framework for Python
https://modAL-python.github.io/
MIT License
2.24k stars 324 forks source link

support Pandas dataframe as training data #20

Closed fighting41love closed 4 years ago

fighting41love commented 6 years ago

Thanks for sharing the great code! Lightgbm is a popular package, which supports numpy, pd.df as train/test data. It would be great for modAL to support pd.df as train/pool/test data. Thanks!

cosmic-cortex commented 6 years ago

Thanks! Adding Pandas support is a great idea! I'll take a look at it soon, hopefully this can be included in the next release.

cosmic-cortex commented 6 years ago

I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset X, then for instance X[0] gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.

This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.

fighting41love commented 6 years ago

The sklearn package is a good example to load pandas data frame. It converts the pd df to numpy. https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62 Hope this will be helpful. Thanks!

jpzhangvincent commented 5 years ago

It would be a very useful feature to improve the workflow and integration with other packages. Is there a branch we can help on this feature?

cosmic-cortex commented 5 years ago

Currently, there are no feature branches specifically for this, but feel free to create one in a fork from the dev branch! I am happy to help, since I also think it is an important problem, I just haven't solved it yet. As I outlined in my previous comment, the main issue for me is that pandas DataFrames are indexed by column first, while numpy arrays are row first. One possible way to solve this is to immediately convert to numpy array, but this kind of defeats the purpose for me.

BoyanH commented 4 years ago

I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset X, then for instance X[0] gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.

This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.

One could handle pandas data frames separately, e.g.

    if isinstance(X, pd.DataFrame):
        return X.iloc[query_indices]

    return X[query_indices]

In order not to include this in all query strategies, these could return indices only as you suggested. This functionality can then be added only in the query() method implementations.

Once this is done, the only changes remaining to support pandas data frames are when working with instance representations, e.g. calculating similarities between them. If #104 is resolved (I am working on it), one could replace the used estimator with an sklearn transformation + estimator pipeline, where the transformation converts the data frame to a matrix. Something similar to:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

ActiveLearner(
            # it's important to clone the model to have separate models for
            # predictions and active learning loop; ActiveLearner fits the
            # provided estimator to the provided training data
            estimator=Pipeline(steps=[
                ('transform', OneHotEncoder()),  # results in a matrix 
                ('classify', RandomForestClassifier())
            ]),
            query_strategy=uncertainty_batch_sampling,
            X_training=X_training,
            y_training=y_training,
            on_transformed=True  # not implemented, should force query strategies to work on transformed data (one hot encoded)
        )