scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

EOF Error Raised while Calling HashingEncoders function #434

Closed shi8tou closed 4 months ago

shi8tou commented 7 months ago

Expected Behavior

HashingEncoders should encode the categorical columns successfully

Actual Behavior

Got EOF error issue while calling HashingEncoders function

Steps to Reproduce the Problem

  1. Packages installed on my laptop: category_encoders==2.6.0 & python==3.10.0

  2. Dataset is here: test_1.csv

  3. Run following code:

    
    import pandas as pd
    import category_encoders as ce

dataset = pd.read_csv('test_1.csv') he = ce.HashingEncoder(cols=['purchase_address'], n_components=2)

dd = he.fit_transform(dataset)

dd.columns



  4. This code "dd = he.fit_transform(dataset)" will throw EOF Error.

## Specifications

  - Version:
  - Platform:
  - Subsystem: 
bmreiniger commented 7 months ago

I can't recreate the issue in Colab, which is running python 3.10.12.

I got an email with a traceback, but it's not here; some wires crossed in github, or you deleted it, or...? That sounded like an issue with the parallelization and maybe not enough memory/space for that, but I'm not an expert about it.

PaulWestenthanner commented 7 months ago

I also couldn't reproduce it on my local linux machine using category-encoders 2.6.0 and python 3.10 in a fresh conda environment. As Ben pointed out for the hashing encoder there are differences with windows when it comes to multi-processing. Are you using windows or Linux/Mac

shi8tou commented 7 months ago

Thanks. I am using Mac air with M2.

shi8tou commented 7 months ago

Here is the error I got: ` Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 269, in run_path return _run_module_code(code, init_globals, run_name, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/test/test_dict.py", line 8, in dd = he.fit_transform(dataset) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, *args, kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, *args, *kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, args, kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/base.py", line 848, in fit_transform return self.fit(X, fit_params).transform(X) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 315, in fit X_transformed = self.transform(X, override_return_df=True) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, *args, *kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, args, kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 488, in transform X = self._transform(X) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/hashing.py", line 174, in _transform data_lock = multiprocessing.Manager().Lock() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 57, in Manager m.start() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/managers.py", line 562, in start self._process.start() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last): File "/test/test_dict.py", line 8, in dd = he.fit_transform(dataset) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, *args, kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, *args, *kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, args, kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/base.py", line 848, in fit_transform return self.fit(X, fit_params).transform(X) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 315, in fit X_transformed = self.transform(X, override_return_df=True) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, *args, *kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped data_to_wrap = f(self, X, args, kwargs) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 488, in transform X = self._transform(X) File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/hashing.py", line 174, in _transform data_lock = multiprocessing.Manager().Lock() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 57, in Manager m.start() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/managers.py", line 566, in start self._address = reader.recv() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/connection.py", line 388, in _recv raise EOFError EOFError`

bmreiniger commented 7 months ago

Thanks. Notice that the first traceback ends with an error from multiprocessing; the EOF is at the end of a second (identical?) traceback.

You might try a newer version of this package: #428 updated the hashing encoder significantly.

The same error shows up in StackOverflow, but I'm not sure how much it helps: https://stackoverflow.com/q/61931669/10495893

PaulWestenthanner commented 7 months ago

I could get access to an old macbook (still with an intel chip) but also could not reproduce the issue on that machine (using a fresh conda installation). Can you try version 2.6.3 as Ben suggests and see if that solves the issue?