integrations: sklearn - Githubissues

pared commented 3 years ago

Seems we should be supporting at least few popular frameworks.

Considering their popularity, we should probably start with:

[x] keras - we have initial implementation
[x] ~~sklearn~~
[x] xgboost

Worth considering:

[x] - ~FastAi~ - #136
[x] - pytorch lightning

TF and PyTorch - it seems to me that using their pure form is done when users need highly custom models, and probably in that cases they will be able to handle dvclive by hand. @dmpetrov did I miss some popular framework?

EDIT: crossing out FastAi as it has its own issue now

dberenbaum commented 3 years ago

I think it's easy enough for users to add integrations as needed (or for the dvc team to add them in response to demand), so it's probably not worthwhile to spend time adding more now.

How do we plan to handle dependencies for multiple frameworks? Each supported framework is pretty heavy, and I think it's unreasonable already to expect an XGBoost user to install Tensorflow to use dvclive. Similar concerns would apply for dvcx.

Thoughts @pared @dmpetrov ?

dberenbaum commented 3 years ago

See #25 for more discussion of dependency management.

pared commented 3 years ago

@dberenbaum I think leaving particular implementations for our users is a good idea, those are easy tasks. Writing tests might be harder, but I guess we can help users write them, instead of doing all the legwork, not even knowing whether particular integrations will be desired by userbase.

As to installation, you are right, we do it already in dvc (for different backends) and we will have to go this way here too.

dberenbaum commented 3 years ago

On second thought here, is it worthwhile to add sklearn integration? Since this is such a large framework, integration may be more complex, and if you have an opinion about how to implement it, probably better to add the integration now than wait for contributions. Even if it means implementing one particular model or class of models, it may be a worthwhile template. Thoughts?

pared commented 3 years ago

Makes sense, I will get to that once I am done with supporting dvclive outputs caching

dberenbaum commented 3 years ago

sklearn is largely not focused on deep learning, which has been the primary use case for dvclive. Should other algorithms be supported? If the primary purpose is to track model training progress, it seems only useful where models are trained iteratively. I only know of a couple of classes of algorithms where this is true:

Gradient descent (including neural networks/deep learning)
Ensemble methods (such as gradient boosting)

pared commented 3 years ago

@dberenbaum Yes, after digging through documentation, it seems to me that in general, learning algorithms divide to those which utilize fit method and both fit and partial_fit. It does not seem to me that we can provide integration for "only fit" models, and in case of partial_fit models, the workflow will probably look more like torch one, which in my opinion does not require any integration, as its created manually.

The only place I could probably see some integration is methods accepting scoring param which can be Callable but it seems to me it would be really hard to define how such integration could work.

daavoo commented 3 years ago

I am considering to work on the integration with pytorch-lightning but I'm not sure about where to contribute the new logger (i.e. this repository or pytorch-lightning itself). See https://github.com/iterative/dvclive/issues/70#issuecomment-811868255

daavoo commented 3 years ago

I added an integration with mmcv:

https://github.com/open-mmlab/mmcv/pull/1075

pared commented 3 years ago

@daavoo Thats a great news! Can we do something to help with that pull request?

daavoo commented 3 years ago

@daavoo Thats a great news! Can we do something to help with that pull request?

It has been already approved so I think it will be merged soon, thanks!

daavoo commented 3 years ago

I think it might be a good idea to have separated issues for each integration in order to better track the progress and have specific discussions for each one (i.e. this issue got "populated" by specific sklearn discussions).

I.e: https://github.com/iterative/dvclive/issues/83

pared commented 3 years ago

@daavoo That is right, in the beggining we intended it to be an umbrella issue, since singular implementations seemed like easy tasks. As sklerarn example shows, we should probably track each integration separately.

For future reference: Changing the name of the issue for sklearn. Other integrations issues should be created as separate issues.

daavoo commented 2 years ago

Reviving this as I think that skearn should be the entry point for discussing what can dvclive provide in "stepless" scenarios (no deep learning no gradient boosting) beyond https://github.com/iterative/dvclive/issues/182

Taking a quick look at our example repositories using sklearn (https://github.com/iterative/example-get-started), it looks that it would be a low-hanging fruit to add some utility to go from (y_true, y_pred) to PRC / ROC plots.

Given that example repo, we would be removing quite a few lines for users:

# Given labels, predictions

precision, recall, prc_thresholds = metrics.precision_recall_curve(labels, predictions)
fpr, tpr, roc_thresholds = metrics.roc_curve(labels, predictions)

# ROC has a drop_intermediate arg that reduces the number of points.
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve.
# PRC lacks this arg, so we manually reduce to 1000 points as a rough estimate.
nth_point = math.ceil(len(prc_thresholds) / 1000)
prc_points = list(zip(precision, recall, prc_thresholds))[::nth_point]
with open(prc_file, "w") as fd:
    json.dump(
        {
            "prc": [
                {"precision": p, "recall": r, "threshold": t}
                for p, r, t in prc_points
            ]
        },
        fd,
        indent=4,
    )

with open(roc_file, "w") as fd:
    json.dump(
        {
            "roc": [
                {"fpr": fp, "tpr": tp, "threshold": t}
                for fp, tp, t in zip(fpr, tpr, roc_thresholds)
            ]
        },
        fd,
        indent=4,
    )

To:

from dvclive.sklearn import log_precision_recall_curve, log_roc_curve

log_precision_recall_curve(labels, predictions)
log_roc_curve(labels, predictions)

iterative / dvclive

integrations: sklearn #5