scikit-learn-contrib / sklearn-pandas

Pandas integration with sklearn
Other
2.79k stars 413 forks source link

Pandas In, Pandas Out? `.inverse_transform()` method #41

Open naught101 opened 8 years ago

naught101 commented 8 years ago

It would be really nice to have the ability to put pandas dataframes into sklearn pipelines, and to have equivalent pandas dataframes returned afterwards. I think that this module would be the place for that - probably all that would be required is a .inverse_transform method on the DataFrameMapper.

Would something like this be wanted in this module? I can make a pull request, if so.

Before I do, why is all the code in __init__.py? Seems like it'll get hard to maintain after a while...

dukebody commented 8 years ago

Hi Naught101!

You can already put pandas dataframes into sklearn pipelines. Just create a pipeline where the first step is the DataFrameMapper.

Regarding the proposal "to have equivalent dataframes returned afterwards", you mean to make the pipeline return a pandas DataFrame? Sklearn pipelines usually return numpy arrays, with either classification probabilities for each class (predict_proba), directly class predictions or regression values. How could you inverse transform that with the initial DataFrameMapper? The output and the input have different shapes and useful transforms.

I believe you can do the indexing thing you proposed at https://github.com/scikit-learn/scikit-learn/issues/5523#issuecomment-150123228 just wrapping the numpy array output into a DataFrame passing as index the one from the original DataFrame you got into the pipe. Am I wrong?

Regarding the reason why all the code is in __init__.py, I guess it is because it was a very small module at first and didn't make a lot of sense to scatter the code along multiple files, although clearly we would need to go that way if the codebase grows, for clarity.

One issue we have one is that the original maintainer of the package (paulgb) is no longer working on it at all, and the second mantainer (Cal Paterson) has been quite irresponsive in the last few months as well. So it's becoming hard to get new code into this repo, and harder to get it into a release. :(

naught101 commented 8 years ago

Aha.. I wasn't thinking clearly, but now I can: DataFrameMappers can also be useful for generating the y value passed to a fit method. The inverse_transform would then be useful to get back a suitable dataframe. But yes, this would be a different DataFrameMapper to the one used for X.

I guess that would all happen outside the pipeline though..

Has anyone working on the code asked @paulgb for push access?

On 22 October 2015 6:52:05 pm AEDT, "Israel Saeta Pérez" notifications@github.com wrote:

Hi Naught101!

You can already put pandas dataframes into sklearn pipelines. Just create a pipeline where the first step is the DataFrameMapper.

Regarding the proposal "to have equivalent dataframes returned afterwards", you mean to make the pipeline return a pandas DataFrame? Sklearn pipelines usually return numpy arrays, with either classification probabilities for each class (predict_proba), directly class predictions or regression values. How could you inverse transform that with the initial DataFrameMapper? The output and the input have different shapes and useful transforms.

I believe you can do the indexing thing you proposed at https://github.com/scikit-learn/scikit-learn/issues/5523#issuecomment-150123228 just wrapping the numpy array output into a DataFrame passing as index the one from the original DataFrame you got into the pipe. Am I wrong?

Regarding the reason why all the code is in __init__.py, I guess it is because it was a very small module at first and didn't make a lot of sense to scatter the code along multiple files, although clearly we would need to go that way if the codebase grows, for clarity.

One issue we have one is that the original maintainer of the package (paulgb) is no longer working on it at all, and the second mantainer (Cal Paterson) has been quite irresponsive in the last few months as well. So it's becoming hard to get new code into this repo, and harder to get it into a release. :(


Reply to this email directly or view it on GitHub: https://github.com/paulgb/sklearn-pandas/issues/41#issuecomment-150136907

Sent from my Android device with K-9 Mail. Please excuse my brevity.

dukebody commented 8 years ago

@calpaterson got write access to this repo, but he's not answering my mails. :S

naught101 commented 8 years ago

Hrm. Is there any reason you couldn't expand the current behaviour to also map the y dataframe? e.g. the call would be mapper = DataFrameMapper(X_features = [(blah...)], y_features = [(blergh)]), and then .fit(), .transform() and .predict() all call whichever transforms are relevant on X and/or y.

dukebody commented 8 years ago

Sounds reasonable. Could you come up with some examples where this y transformation would be useful?

dukebody commented 8 years ago

@naught101 I have write access now to this repo so we can work this out if you come out with useful use cases. :)

dukebody commented 8 years ago

@naught101 you might want something similar to what is discussed in https://github.com/paulgb/sklearn-pandas/issues/13 ?

naught101 commented 8 years ago

Yeah, I suspect that #13 is a prerequisite for this issue..

ethanluoyc commented 8 years ago

If say the transformed dataframe has exactly the same shape as the dataframe before the transformation. Can we pass in the columns to regenerate the predicted results in a DataFrame format?

dukebody commented 8 years ago

@ethanluoyc Could you provide a code example of how that feature would work? Not the implementation, but how one would use it.

ethanluoyc commented 8 years ago

I am doing something on basketball so I will just give an exmaple on this say I have this dataframe,

screenshot 2015-11-08 22 02 01

after the conversion I will get something like this.

screenshot 2015-11-08 22 03 52

Which basically did substitution on based on the position of the keyword (which is the name) I have in a text string, for example,

"Jumpball: (Zydrunas Ilgauskas)\PN vs. (Kendrick Perkins)\PN ((Mo Williams)\PN gains possession)"

So the two dataframes actually has the same shape. I don't know whether I can do such inverse transformation.

I checked out #13 and I think the approach can work, however, as I referenced on the documentation on sklearn I stumble about their docs on the attribute activefeatures, I decided to look into that in more details once I figure out what teh activefeatures attribute does.

dukebody commented 8 years ago

I believe we can do the inverse transformation if we: * Track which array columns correspond to each dataframe columns. * Every transformer used has an inverse_transform(X) method.

It shouldn't be too hard to do. Any takers? :)

Yevgnen commented 7 years ago

Can sklearn-pands inverse_transform the transformed data right now ?

dukebody commented 7 years ago

No, it can't right now.

dukebody commented 6 years ago

Last intent to do this was https://github.com/pandas-dev/sklearn-pandas/pull/56 but it stalled waiting for input from other dev. Perhaps we can retake it?

devforfu commented 6 years ago

Am I right that this feature should be something like:

df = pd.DataFrame({'colA': list('ynyyn'), 'colB': list('abcab')})
mapper = DataFrameMapper([
    ('colA', [LabelEncoder()]),
    ('colB', [LabelEncoder()]),
])
transformed = mapper.fit_transform(df)
restored = mapper.inverse_transform(transformed)

Where transformed will be something like:

np.array([[0, 0],
          [1, 1],
          [0, 2],
          [0, 0],
          [1, 1]])

And, restored is the original dataframe:

colA colB
   y    a
   n    b
   y    c
   y    a
   n    b

So, basically, the DataFrameMapper will be able to "rollback" the result into original dataframe like sklearn transformers do?

dukebody commented 6 years ago

@devforfu yes, this is what I understand.

To do so we need to keep track of which columns correspond to which features in the transformed output, and then run the transformer inverse on each block.

erikjandevries commented 6 years ago

Hi all, I've worked on a fork to create a solution for this problem. It passes the test

def test_inverse_transform_multicolumn():
    df = pd.DataFrame({'colA': list('ynyyn'), 'colB': list('abcab'), 'colC': list('sttts')})
    mapper = DataFrameMapper([
        ('colA', LabelEncoder()),
        ('colB', LabelBinarizer()),
        ('colC', LabelEncoder()),
    ])

    transformed = mapper.fit_transform(df)
    restored = mapper.inverse_transform(transformed)

    assert isinstance(restored, pd.DataFrame)    
    assert restored.equals(df)

which includes a LabelBinarizer that generates multiple columns. So far I'm assuming the mapper takes a pandas data frame and outputs a numpy array; I'm not yet dealing with self.input_df.

I'd like to improve this solution (I've now included an extra self.transformed_cols_ to keep track of mapped columns, but that should ideally be integrated with self.transformed_names_. However I haven't yet checked the implications of modifying the latter, so that's why I've simply added the parameter for now.

What would be the next steps? I've no idea if somebody else is already working on this, but I'm assuming I'll update my solution, commit it to my fork and then click on 'pull request' in my forked repository on GitHub? Do I need to keep anything else in mind?

devforfu commented 6 years ago

@erikjandevries I guess you only need to run tox to see if all tests pass. Probably, add a couple more tests to see if your implementation correctly handles other cases, e.g. several transformers, like:

mapper = DataFrameMapper([
    ('colA', [CategoricalImputer(), LabelEncoder()])
    ('colB', [Imputer(), StandardScaler()])
    # other transformers
])

Or maybe any other edge cases.

Then, if everything is fine, you could make a pull request and wait for a review from the repo owners. (As well as response from Circle CI which could show if your implementation has any issues).

Whamp commented 5 years ago

interested to see if there's been any progress on this issue. Seems like a pretty major limitation to not be able to recover the original data after transformation.

Whamp commented 5 years ago

is there any issue with @erikjandevries code here? looks fine to me but hasn't been accepted

https://github.com/scikit-learn-contrib/sklearn-pandas/pull/133/commits/1b4edd9e9a7de56a25259b288150d06ece9701fd

dukebody commented 5 years ago

I'm very sorry, I'm busy lately with other stuff in my life and haven't managed to review this... Would any of you be interested in becoming a project admin with merge rights?

El dc., 11 jul. 2018 , 00:27, Whamp notifications@github.com va escriure:

is there any issue with @erikjandevries https://github.com/erikjandevries code here? looks fine to me but hasn't been accepted

1b4edd9 https://github.com/scikit-learn-contrib/sklearn-pandas/commit/1b4edd9e9a7de56a25259b288150d06ece9701fd

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/sklearn-pandas/issues/41#issuecomment-403985850, or mute the thread https://github.com/notifications/unsubscribe-auth/AACj4QoHVafFR6lgMi9Cdmq27KbFW2OWks5uFSo7gaJpZM4GTgzi .

erikjandevries commented 5 years ago

I'm sorry to say I've also been very busy. If I'm not mistaken the problem with my code was that I created a new variable self.transformed_cols_ where I should have used the existing self.transformed_names_ I did this since I wasn't sure what I might break otherwise or I wasn't sure how to use the transformed names variable... It's been a long time, I think I found another way around for the problem I was dealing with at the time, but perhaps the update could still be useful.

https://github.com/scikit-learn-contrib/sklearn-pandas/pull/133

devforfu commented 5 years ago

@dukebody I usually track the sklearn_pandas repository changes and pull-requests and use it in my daily tasks so I could work on this if nobody else decides to take this responsibility.

dukebody commented 5 years ago

@devforfu Thanks! I've sent you an invite to become collaborator with write access to this repo, so you can merge stuff. Do you have an account in Pypi so I can give you access to publish new releases there?

devforfu commented 5 years ago

@dukebody Sure, not a problem! Yes, I've created one, the username is devforfu.

dukebody commented 5 years ago

@devforfu Added you to pypi. I guess you should have received some kind of notification about it.

Can you take care of managing next release after working out existing PRs?

devforfu commented 5 years ago

@dukebody Yes, the notification was received.

Ok, sure, will do as soon as finalize the pending changes.

AlanGanem commented 4 years ago

Hello guys. Any update on this issue?

sxooler commented 4 years ago

I am joining @AlanGanem: Is there any update? I can see some updates in #133 and #182 , but it's already been more than 1 year and nothing was approved and merged.

GitHunter0 commented 3 years ago

I am joining @AlanGanem: Is there any update? I can see some updates in #133 and #182 , but it's already been more than 1 year and nothing was approved and merged.

Yes, it is a pitty, this would be a very useful feature