nubank / fklearn

fklearn: Functional Machine Learning
Apache License 2.0
1.5k stars 163 forks source link

Is there a reason why the `object` in learner logs isn't inside the learner key? #108

Open robotenique opened 4 years ago

robotenique commented 4 years ago

Code sample

Taking a look at the return logs of the learners, e.g. the logistic regression one:

    log = {'logistic_classification_learner': {
        'features': features,
        'target': target,
        'parameters': merged_params,
        'prediction_column': prediction_column,
        'package': "sklearn",
        'package_version': sk_version,
        'feature_importance': dict(zip(features, clf.coef_.flatten())),
        'training_samples': len(df)},
        'object': clf}

Problem description

Is there a reason why the object key isn't inside the dictionary of logistic_classification_learner? This leads to a problem where, if I have multiple learners in my pipeline, the final object depends only on the order of the learners inside the pipeline, and I lose the objects of the first learners. E.g.: My pipeline is (logistic_regression, isotonic_calibration). Since the build_pipeline function will merge the logs of the two objects, the final object will have only the isotonic calibration, and I lose the logistic_regression object.

Expected behavior

Access all learner objects of the pipeline, not just the last one.

Possible solutions

Put the learner object inside the dictionary of the logs:

    log = {'logistic_classification_learner': {
        'features': features,
        'target': target,
        'parameters': merged_params,
        'prediction_column': prediction_column,
        'package': "sklearn",
        'package_version': sk_version,
        'feature_importance': dict(zip(features, clf.coef_.flatten())),
        'training_samples': len(df),
        'object': clf}
        }
caique-lima commented 4 years ago

I'll double check this, but seems that we have some typo. Looking at the code this "object" key should be dropped to avoid a huge training log, I'm saying this based on this line https://github.com/nubank/fklearn/blob/master/src/fklearn/training/pipeline.py#L75

If the was 'obj' instead of 'object', the key would be dropped in your learner's log, and will be available only in the key '__fkml__', under the learners key. But given that the name is object, nothing happens