Advice needed for regression case

stewu5 commented 3 years ago

I was able to successfully apply this method to a practical classification problem and got pretty good results. However, when I try to use the default setup for a regression random forest in Python (using jupyter notebook environment) with 100,000 data points, the kernel crashed. Is there anything I should be aware of when dealing with regression instead of classification? Thanks.

sato9hara commented 3 years ago

What is the cause of the kernel crash?

stewu5 commented 3 years ago

There was no error/warning message at all before it crashing. I left it run for overnight and the next day the kernel just says it is dead. Usually that would be memory issue. So, I wonder if there is something wrong with the way I directly use default settings to handle a large data set with regression problem.

sato9hara commented 3 years ago

If you look into the code, you will find that R = self.__getBinary(X, splitter) is called at the beginning of __fitFAB. This generates a large matrix of size '# of data' times '# of splitting rules'. This can require much memory. One possible way to handle large data would be to implement __getBinary to return sparse matrix (because the matrix contains only 0 and 1) instead of the dense matrix (the current implementation).

stewu5 commented 3 years ago

I see. I may consider that. Thanks.

stkarlos commented 1 year ago

Thanks for that implementation.

Regarding the regression case, I take the next message:

"mdl.fit(Xtr, ytr, splitter, Kmax, fittype='EM') Traceback (most recent call last):

File "C:\Users\StamatisKarlos\anaconda3\envs\customer\lib\site-packages\pandas\core\indexes\base.py", line 3803, in get_loc return self._engine.get_loc(casted_key)

File "pandas_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc

File "pandas_libs\index.pyx", line 144, in pandas._libs.index.IndexEngine.get_loc

TypeError: '(slice(None, None, None), 8)' is an invalid key

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "C:\Users\StamatisKarlos\AppData\Local\Temp\ipykernel_22584\192973455.py", line 1, in mdl.fit(Xtr, ytr, splitter, Kmax, fittype='EM')

File "c:\users\stamatiskarlos\onedrive - satori analytics\documents\git\defragtrees\defragtrees.py", line 266, in fit defr = self.fitdefragger(X, y, s, K, fittype, self.modeltype, self.maxitr, self.qitr, self.tol, self.eps, self.delta, self.seed+itr, self.verbose_)

File "c:\users\stamatiskarlos\onedrive - satori analytics\documents\git\defragtrees\defragtrees.py", line 296, in fit_defragger defragger.fit(X, y, splitter, K, fittype=fittype)

File "c:\users\stamatiskarlos\onedrive - satori analytics\documents\git\defragtrees\defragtrees.py", line 489, in fit self._fitEM(X, y, splitter, K, self.seed)

File "c:\users\stamatiskarlos\onedrive - satori analytics\documents\git\defragtrees\defragtrees.py", line 534, in fitEM R = self.getBinary(X, splitter)

File "c:\users\stamatiskarlos\onedrive - satori analytics\documents\git\defragtrees\defragtrees.py", line 481, in __getBinary R[:, i] = X[:, int(splitter[i, 0])] >= splitter[i, 1]

File "C:\Users\StamatisKarlos\anaconda3\envs\customer\lib\site-packages\pandas\core\frame.py", line 3805, in getitem indexer = self.columns.get_loc(key)

File "C:\Users\StamatisKarlos\anaconda3\envs\customer\lib\site-packages\pandas\core\indexes\base.py", line 3810, in get_loc self._check_indexing_error(key)

File "C:\Users\StamatisKarlos\anaconda3\envs\customer\lib\site-packages\pandas\core\indexes\base.py", line 5968, in _check_indexing_error raise InvalidIndexError(key)

InvalidIndexError: (slice(None, None, None), 8)"

Any suggestions?

Update: The above issue has been solved, providing numpy arrays rather than pandas Dataframes. However, I find obstacles on fitting 100k training instances in regression case. Have you recorded any progress on that?

Kind regards.

sato9hara / defragTrees

Advice needed for regression case #5