karthikchiru12 / ML_Helper

A simple package to assist in Machine Learning tasks
0 stars 0 forks source link

Memory (Cannot allocate *say* 200GB for the matrix of size() ) error while using features with huge numer of categories #5

Open karthikchiru12 opened 3 years ago

karthikchiru12 commented 3 years ago

Whenever the categorical feature that we are trying to one hot encode has a larger number of categories, Then in the featurize.py, line 118 " i.todense() " will unpack a huge array and the program stops there. Instead of np.hstack() we can use scipy.sparse.hstack() to directly stack sparse matrices without converting it into dense matrix. Also scipy.sparse.hstack() supports appending sparse and dense matrices together.

karthikchiru12 commented 3 years ago

scipy.sparse.hstack() is working perfectly for regression use cases. But coming to the classification problems, most of the algorithms use something like x_train.todense() which again brings up the same error.