scikit-learn-contrib / skglm

Fast and modular sklearn replacement for generalized linear models
http://contrib.scikit-learn.org/skglm
BSD 3-Clause "New" or "Revised" License
157 stars 29 forks source link

Does skglm support partial_fit? #186

Closed rookie0620 closed 1 year ago

rookie0620 commented 1 year ago

I wonder dose skglm(such as MCPRegression) support functions just like online learning in scikit, If not, what should i do if I want process some very very big data @Badr-MOUFAD

Badr-MOUFAD commented 1 year ago

Unfortunately, skglm doesn't have an online learning solver for MCPRegression

what should i do if I want process some very very big data

How big is the data and how much computing power do you have access to? The answer to these questions should determine what can be done.

Here are some hints to help you get started

rookie0620 commented 1 year ago

Thanks for reply @Badr-MOUFAD I wish to perform feature selection on a million features by MCPregression, Data takes up a lot of memory, for example,if I read data from csv, 200 lines take about 20G memory,(by pandas)which is quite annoying. So I have two questions 1.what will happenIf I read in the data in batches and call the fit function again and again 2.Is there any other data format or python library I can choose

Badr-MOUFAD commented 1 year ago

1.what will happenIf I read in the data in batches and call the fit function

You can fit on batches of data by warm starting AndersonCD from one batch to the other

Assuming that you have a data loader that handles shuffling, splitting, and organizing the data into batches, the implementation should resemble

# init your solver
datafit = compiled_clone(Quadratic())
penalty= compiled_clone(MCPenalty(alpha, gamma))

solver = AndersonCD(warm_start=True)

# init warm start params
w_init = np.zeros(n_features)
Xw_init = np.zeros(batch_size) 

# fit on batches
for X, y in dataloader(X, y):
    # prime datafit 
    datafit.initialize(X, y)

    # warm started solve
    w, *_ = solver.solve(
        X, y,
    datafit, penalty,
    w_init, Xw_init
    )

    # update warm start vars
    w_init = w
    Xw_init = X @ w

Nevertheless, I cannot say much about the convergence of this scheme nor assure the relevance of the solution.

.Is there any other data format

I'm afraid there is no other format to reduce the data storage at compute time.

python library I can choose

For instance, have a look at the distributed framework Dask ML, Spark ML, and RAPIDS. Though, I doubt they support MCP regreesion.