Closed rookie0620 closed 1 year ago
Unfortunately, skglm
doesn't have an online learning solver for MCPRegression
what should i do if I want process some very very big data
How big is the data and how much computing power do you have access to? The answer to these questions should determine what can be done.
Here are some hints to help you get started
Big data with low computing power
Check if the data admits a sparse format it would reduce its memory requirement. skglm
supports sparse datasets and for instance, can run on rcv1 data which, if densified, doesn't fit in memory
Big data with access to high computing power Choose a computer cluster with a big memory enough to accommodate the data and skglm should work as except
Thanks for reply @Badr-MOUFAD I wish to perform feature selection on a million features by MCPregression, Data takes up a lot of memory, for example,if I read data from csv, 200 lines take about 20G memory,(by pandas)which is quite annoying. So I have two questions 1.what will happenIf I read in the data in batches and call the fit function again and again 2.Is there any other data format or python library I can choose
1.what will happenIf I read in the data in batches and call the fit function
You can fit on batches of data by warm starting AndersonCD
from one batch to the other
Assuming that you have a data loader that handles shuffling, splitting, and organizing the data into batches, the implementation should resemble
# init your solver
datafit = compiled_clone(Quadratic())
penalty= compiled_clone(MCPenalty(alpha, gamma))
solver = AndersonCD(warm_start=True)
# init warm start params
w_init = np.zeros(n_features)
Xw_init = np.zeros(batch_size)
# fit on batches
for X, y in dataloader(X, y):
# prime datafit
datafit.initialize(X, y)
# warm started solve
w, *_ = solver.solve(
X, y,
datafit, penalty,
w_init, Xw_init
)
# update warm start vars
w_init = w
Xw_init = X @ w
Nevertheless, I cannot say much about the convergence of this scheme nor assure the relevance of the solution.
.Is there any other data format
I'm afraid there is no other format to reduce the data storage at compute time.
python library I can choose
For instance, have a look at the distributed framework Dask ML, Spark ML, and RAPIDS. Though, I doubt they support MCP regreesion.
I wonder dose skglm(such as MCPRegression) support functions just like online learning in scikit, If not, what should i do if I want process some very very big data @Badr-MOUFAD