tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
663 stars 110 forks source link

multiprocessing problem #47

Closed tsachiblauamat closed 3 years ago

tsachiblauamat commented 3 years ago

When running a few RF with multiprocessing(in parallel) its working. but when running a few RF with multiprocessing after RF its stuck. I'm running multiprocessing with the class multiprocessing by running the command:

pool = multiprocessing.Pool()
pool.map(func, input)

in func I'm running tensorflow-RF

Any idea why this is happening?

Thanks, Tsachi

arvnds commented 3 years ago

Hi
tsachiblauamat, it's hard to tell without more details, but I would not be surprised if you are dealing with thread contention. The RF training is already multithreaded, and you can adjust the number of threads in the constructor. See https://github.com/tensorflow/decision-forests/issues/39#issuecomment-882519315 for more details.

Let me know if that doesn't answer your question!

janpfeifer commented 3 years ago

Btw tsachiblauamat, what are you trying to run in "func" ? The training or the evaluation of a tensorflow RF ?

We've never tried what you are trying, but I know the underlying inference engine can run in a multi-thread system -- the TF Serving engine will issue parallel calls to the inference engine, and it just works, as far as we know.

But I'm not very familiar with the multiprocessing library. Reading through it, in Linux by default it uses the fork(2) system call, which, as I would expect: "Note that safely forking a multithreaded process is problematic." I wonder how this would play out with tensorflow subsystems...

Care to share more details ?

tsachiblauamat commented 3 years ago

func just train and predict the output. when using multiprocessing it is not stable.

sometimes its crushes and sometimes I get his error

Traceback (most recent call last): File "python3.8/multiprocessing/util.py", line 224, in call res = self._callback(*self._args, **self._kwargs) File "python3.8/multiprocessing/pool.py", line 712, in _terminate_pool if p.exitcode is None: File "python3.8/multiprocessing/process.py", line 232, in exitcode return self._popen.poll() File "python3.8/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt:

and sometimes its fine

janpfeifer commented 3 years ago

Odd ... I'm not familiar with how multiprocess library work, but it's likely an interaction with it and how TF works.

But I'd suggest an alternative: have the train and evaluate run on completely separate python program, and have a "controller program" start them. Serialize results to disk, and read them from the "controller" program -- easier than dealing with pipes/signals (which seems where the multiprocessing library is failing).

achoum commented 3 years ago

Hi,

Note: As Jan mentioned, TF-DF training support multi-thread training. If you have a single large dataset, this is the best approach.

If you want to train multiple small models in parallel, you should be able to train different models in parallel in different threads / processes.

multiprocessing.Pool

Multi-processing in python can be kind of tricky. For simplicity and if possible, I would use multi-threading instead.

Here is a workable example that train and run 5 small models in parallel:

!pip install tensorflow_decision_forests -U -q

import tensorflow_decision_forests as tfdf
from multiprocessing.pool import ThreadPool
import numpy as np

print(tfdf.__version__)

def train_model(model_id):
  x_train = np.random.uniform(size=(50, 1))
  y_train = x_train[:, 0] >= 0.5
  model = tfdf.keras.GradientBoostedTreesModel(num_trees=10)
  model.fit(x=x_train, y=y_train)
  return np.mean(model.predict(x_train))

# Train 5 models, and print the mean predicted value on the training dataset.
pool = ThreadPool(5)
print(pool.map(train_model, range(5)))