scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

SLEP 007 feature-name generation: adding a constructor argument for verbosity #32

Closed amueller closed 4 years ago

amueller commented 4 years ago

Basically right now SLEP 007 suggests to add a constructor parameter to all transformers that are not feature selectors, right?

I see the motivation and I think it's actually hard to come up with a better solution (I don't have one right now), but I'm also concerned with adding feature name specific things to the constructor. It seems orthogonal to the working of the estimator.

I think the alternative that @adrinjalali suggested was having callbacks for formatting feature names (I don't remember the details tbh), but that was pretty complex.

Maybe we could have a method that people could overwrite or something like that? I'm not sure. To me this is the one contentious part of the SLEP.

amueller commented 4 years ago

cc @glemaitre @NicolasHug I guess?

glemaitre commented 4 years ago

It seems orthogonal to the work

Could you be more specific? is it that the parameter itself is more related to some scikit-learn goodies instead of some specific internal estimator machinery?

callbacks for formatting feature names

I think it would be a neat proposal. Having a callbacks parameter to which you could pass a list of callback (verbose, feature_names, early_stopping, etc.) would be something that I would like. However, going directly in this direction, I am scared that we delay the support for feature names.

Maybe the trade-off here would be to have something that we can easily remove in the future for a more generic callback solution. Thus, this is true that adding an estimator parameter will be annoying once we want to get rid of. Overwriting a method is maybe a better option.

NicolasHug commented 4 years ago

I have suggested using formats, e.g. "log({input_name})". That can be an attribute.

glemaitre commented 4 years ago

@NicolasHug could you provide a minimal example for a class. I am not sure how it is looking like.

On Mon, 17 Feb 2020 at 14:04, Nicolas Hug notifications@github.com wrote:

I have suggested using formats, e.g. "log({input_name})". That can be an attribute.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/issues/32?email_source=notifications&email_token=ABY32P7FGU73DNHAMSQMRLLRDKDOVA5CNFSM4KVRKJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL6KVYI#issuecomment-586984161, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32PZ66KA2YOVJR3UNEFTRDKDOVANCNFSM4KVRKJ3A .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

NicolasHug commented 4 years ago

https://github.com/scikit-learn/enhancement_proposals/pull/17/files#r378916622

glemaitre commented 4 years ago

OK but you still need to pass the formatting to the class in case you want to tune it, isn't it?

On Mon, 17 Feb 2020 at 15:05, Nicolas Hug notifications@github.com wrote:

https://github.com/scikit-learn/enhancement_proposals/pull/17/files#r378916622

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/issues/32?email_source=notifications&email_token=ABY32P6CDKMA5UBDGMB2S2TRDKKRXA5CNFSM4KVRKJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL6QZJY#issuecomment-587009191, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32P55M2L6YMRDHJP3IULRDKKRXANCNFSM4KVRKJ3A .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

NicolasHug commented 4 years ago

Yes, I just think it's a simpler alternative to a method

adrinjalali commented 4 years ago

Basically right now SLEP 007 suggests to add a constructor parameter to all transformers that are not feature selectors, right?

not really. My intention was to suggest the possibility of having such a parameter to transformers, wherever we think there's demand for more than one way of feature generation. Is the concern that we'll need to add the parameter to many estimators, or that we don't want to have such a constructor parameter at all?

However, going directly in this direction, I am scared that we delay the support for feature names.

To me this is the one contentious part of the SLEP.

I think we can leave those as enhancements and leave them for future PRs, which won't even need a SLEP. My idea is that the SLEP can go ahead with a True/False verbosity level, and the rest can be considered future backward compatible enhancements.

amueller commented 4 years ago

Sorry my phone seems to have given me grief and cut off a sentence. What I meant to say was:

It seems orthogonal to the working of the estimator.

In other words most of the parameters (all but verbose really) influence what the model does. This one would not. That's why I'm hesitant to add it to the constructor. And yes, my concern is that in my understanding of the SLEP we would have to add it in many many places.

not really. My intention was to suggest the possibility of having such a parameter to transformers, wherever we think there's demand for more than one way of feature generation.

From the SLEP it sounded to me like that was an intrinsic aspect of the feature name API. If I misunderstood that, then maybe it should be clarified. So you're suggesting that we figure out whether to add this parameter or not on a case-by-case basis? How do we decide where we add it and what the default is?

amueller commented 4 years ago

I'm happy to do a simple solution that we can easily implement. I think my initial thought was not to change the feature names in any of the one-to-one transformers and that's what I had implemented, so it would be verbose_feature_names=False. If we want to implement that and delay further additions to the future I'm fine with that. Right now I read the SLEP as adding a constructor argument to a large number of estimators and I was skeptical about that.

adrinjalali commented 4 years ago

How do we decide where we add it and what the default is?

If the implementation by the implementer w/o the parameter causes a discussion in the review process, we'd add the parameter to make everybody happy. At least that's what I thought.

I'm happy to do a simple solution that we can easily implement. I think my initial thought was not to change the feature names in any of the one-to-one transformers and that's what I had implemented, so it would be verbose_feature_names=False. If we want to implement that and delay further additions to the future I'm fine with that.

I'd also be very much happy about that.

amueller commented 4 years ago

Would you be happy rephrasing it as in #38 ?

jnothman commented 4 years ago

I like the idea of a format_feature_name method that can be overridden. What variables to provide it is one of several messy questions in this approach

adrinjalali commented 4 years ago

I like the idea of a format_feature_name method that can be overridden. What variables to provide it is one of several messy questions in this approach

Wouldn't that mean the user should subclass an estimator to be able to change the behavior? Seems a bit too cumbersome.

I think I'd be happy with methods which set those flags for the estimator, and not have them as an __init__ param. Same with the approach in the sample props proposal/PR. But that means we'd need to have a better way of handling such private attributes, and have a better clone which can handle those cases systematically.

amueller commented 4 years ago

The users could just overwrite a method by assignment and make everyone with an OO background cringe ;)

Any of these seem like we need to introduce a new mechanism to sklearn though. I'm not opposed to adding a method that allows setting these kinds of parameters (in D3M they are called control parameters, I think). An alternative would be to have a single constructor argument that encapsulates anything that is not related to estimator behavior that will come up in the future and have it be a dict. Arguably the verbosity would be one of those, though.

lr = LogisticRegression(C=0.1, settings={'logging_callback': somefunction,
    'feature_name_func': anotherfunction, 'return_dataframe': False})

Unclear if that's better or worse than having a lr.set_config(...) method, but a constructor argument dict wouldn't require rewriting clone. Though clone is arguably pretty weird.

NicolasHug commented 4 years ago

Arguably the verbosity would be one of those, though

verbose, but also n_jobs, pre-dispatch, cache_size, memory...

I like the idea of distinguishing parameters that have an effect on the predictions (in other words, those that one might want to search over), and the rest.

amueller commented 4 years ago

@NicolasHug this is basically what D3M does. Though it's not super easy to define the distinction. "want to search over" is probably not a good one, as you probably don't want to search over tol but the choice of tol still influences the results. Everything that you mention doesn't change the result, so you could have "numerically the same" be the definition. I guess I included return_dataframe which would drastically change the result.

Another way to think about it is parameters that are there for all or most models vs those that are model specific.

adrinjalali commented 4 years ago

Unclear if that's better or worse than having a lr.set_config(...) method, but a constructor argument dict wouldn't require rewriting clone. Though clone is arguably pretty weird.

I'm trying to think how it'll all look, including the docs. We could, for instance, have it as an __init__ parameter, but also have the set_config or something with a Bunch for convenience. This would also remove the need to change clone.

Then the question is, where do we document these possible arguments? In the init, or in that method, or both? In my mind it'd be nicer to have it only in the method's doc, and not all the info in the init's.

But this still makes the init signature pretty long.

amueller commented 4 years ago

I'd be fine with just adding it in the set_config method (which would mostly be syntactic sugar and a way to surface these settings). It's still unclear to me what the scope would be, though.

adrinjalali commented 4 years ago

It kinda deserves it's own SLEP though :/

adrinjalali commented 4 years ago

Doest #38 close this then?

jnothman commented 4 years ago

The users could just overwrite a method by assignment and make everyone with an OO background cringe ;)

Won't work with cloning...

amueller commented 4 years ago

I think "fixing" this would require it's own SLEP but I think we're good to move forward for now.