jpmml / jpmml-evaluator-python

PMML evaluator library for Python
GNU Affero General Public License v3.0
20 stars 9 forks source link

The batch prediction mode should support row-level exception handling #26

Open vruusmann opened 1 month ago

vruusmann commented 1 month ago

See https://github.com/jpmml/jpmml-evaluator/issues/271#issuecomment-2290804790 and https://github.com/jpmml/jpmml-evaluator/issues/271#issuecomment-2291134614

brother-darion commented 4 weeks ago

on the batch predict scene, I think it would be better if has option to choose raise exception or set this record result to NAN and keep predict the others.

The Java interface o.j.e.Evaluator only supports single-row prediction mode via Evaluator#evaluate(Map).

The Python interface builds its batch prediction mode jpmml_evaluator.Evaluator.evaluateAll(DataFrame) on top it. The main benefit of the batch interface is to send all rows from Java to Python as a single call (instead of many calls, one call per row).

Now, this is actually a good idea that the JPMML-Evaluator-Python should provide an option for configuring a "what to do about an EvaluationException".

I can quickly think of two options:

  1. "return invalid" aka "as-is". Matches the current behaviour, where the Java exception is propagated to the top, and the evaluation is stopped at that location.
  2. "replace with NaN" aka "ignore". The Java component will catch a row-specific exception, and replaces the result for that row with Double#NaN (or some other user-specified constant?).

Also, in "return invalid" aka "as-is" mode, it should be possible to configure if partial results can be returned or not. Suppose there is a batch of 10'000 rows, and the evaluation fails on row 8566 because of some data input error. I think it might makse sense to return the leading 8565 results in that case.

right, it's really friendly options; and this two options is adding under current version behavior which is just throw exception, right? like you said , it's importance to clear feedback, this options is importance either.

and I was thinking the "replace with NaN" need a threshold or specified rows number to stop evaluation or not, because on some specified scene which is people use the wrong data, it would be a little annoying that still evaluation all data.

what is your thinking?

vruusmann commented 3 weeks ago

There is a third option - "omit row" aka "drop". If there are evaluation errors, then the corresponding rows are simply omitted from the results batch.

The "omit row" option assumes that the user has assigned custom identifiers to the rows of the arguments batch. So, if there are 156 argument rows, and only 144 result rows (meaning that 12 rows errored out), then the user can locally identify "successful" vs "failed" rows in her application code.

See https://github.com/jpmml/jpmml-evaluator-python/issues/23 about row identifiers.

vruusmann commented 3 weeks ago

As a general comment - my "design assumption" behind the Evaluator.evaluateAll(X) method is that the size of the arguments dataframe is about/up to 10'000 cells (eg. a dataframe of 10 features x 1000 rows).

My thinking is that the data is being moved between Python and Java environments using the Pickle protocol. If the pickle payload gets really big (say, 1'000'000 cells instead of 10'000 cells), then the Java component responsible for loading/dumping might start hitting unexpected memory/processing limitations.

If the dataset is much bigger than 10'000 cells, then it should be partitioned into multiple chunks in Python application code. And the chunking algorithm should be prepared to handle the "omit row" option gracefully.

vruusmann commented 3 weeks ago

my "design assumption" behind the Evaluator.evaluateAll(X) method is that the size of the arguments dataframe is about/up to 10'000 cells

The Evaluator.evaluateAll(X) method should have an extra parameter for controlling the batch size. The default would be my design assumption - about 10'000 cells. But the end user can increase or decrease its value if needed.

This way, the chunking logic would be nicely available at the JPMML-Evaluator-Python library level, leaving the actual Python application code clean.