scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

RecursionError: maximum recursion depth exceeded while calling a Python object #371

Closed TobiasSackmannDacoso closed 1 year ago

TobiasSackmannDacoso commented 1 year ago

Expected Behavior

no erros during hashing

Actual Behavior

following error is thrown: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/user/.local/lib/python3.8/site-packages/category_encoders/hashing.py", line 162, in require_data self.require_data(self, data_lock, new_start, done_index, hashing_parts, cols=cols, process_index=process_index) File "/home/user/.local/lib/python3.8/site-packages/category_encoders/hashing.py", line 162, in require_data self.require_data(self, data_lock, new_start, done_index, hashing_parts, cols=cols, process_index=process_index) File "/home/user/.local/lib/python3.8/site-packages/category_encoders/hashing.py", line 162, in require_data self.require_data(self, data_lock, new_start, done_index, hashing_parts, cols=cols, process_index=process_index) [Previous line repeated 954 more times] File "/home/user/.local/lib/python3.8/site-packages/category_encoders/hashing.py", line 157, in require_data hashing_parts.put({part_index: data_part}) File "", line 2, in put File "/usr/lib/python3.8/multiprocessing/managers.py", line 834, in _callmethod conn.send((self._id, methodname, args, kwds)) File "/usr/lib/python3.8/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) RecursionError: maximum recursion depth exceeded while calling a Python object

Steps to Reproduce the Problem

  1. Load 350 million line dataset
  2. Leave all parameters as default. Only set max_sample=2000 (also tried 200 or 10000)
  3. Try to encode the dataset: 6 out 10 features have to be encoded

Specifications

PaulWestenthanner commented 1 year ago

The problem is that we iterate recursively through the data in the multiprocessing hashing. So the maximum recursion depth is reached if there is more data than max_recursion_depth max_sample n_processors. I've fixed this by using a while loop instead of recursion