scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.43k stars 25.26k forks source link

Creating a sequential validation dataset for adaptive SGD #5087

Closed qqwjq closed 6 years ago

qqwjq commented 9 years ago

I'm implementing an adaptive SGD algorithm as in this paper, which suggests splitting data into training and validation, and alternate between SGD step for parameters and penalty parameter.

Steffen Rendle (2012): Learning Recommender Systems with Adaptive Regularization, in Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM 2012), Seattle.

I noticed that an issue with creating two instances of "CSRDataset". Once the "validation_dataset" is created, the original "dataset" becomes destroyed. I've been playing with this for a while but don't have a fix around the problem. Can someone suggest a fix? Thank you very much!

dataset = CSRDataset(X.data, X.indptr, X.indices, y_i, sample_weight) validation_dataset = CSRDataset(X_validation.data, X_validation.indptr, X_validation.indices, y_validation, sample_weight_validation)

mblondel commented 9 years ago

I think this might be related to the fact that CSRDataset doesn't keep a reference to the data. From Python's point of view, X appears like it's not used anymore and so gets garbage collected.

The problem is fixed in this PR: https://github.com/scikit-learn/scikit-learn/pull/4738/files#diff-0cd8bf0fd114c7c361fe6ab76cee08e3R183

but the PR is not merged yet.

A quick fix would be to add a reference to X at the end of your program. Same for X_validation.

amueller commented 6 years ago

closing as these are internals that we probably don't want to maintain.