jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
892 stars 255 forks source link

Impose soft limit on the maximum number of input fields #44

Closed vruusmann closed 5 years ago

vruusmann commented 7 years ago

People are working with models that specify tens to hundreds of THOUSANDS input fields:

Evaluator evaluator = ...;
List<InputField> inputFields = evaluator.getInputFields();
System.out.println(inputFields.size()); // Prints 100'000

For example: http://stats.stackexchange.com/questions/152891/bad-performance-of-pmml-evaluator and http://stackoverflow.com/questions/42074491/evaluate-method-takes-long-time-pmml-models-using-jpmml

Understandably, such "structurally valid but conceptually/functionally invalid" models cannot be made to perform, not by the JPMML-Evaluator library, or any other PMML scoring engine.

By default, the JPMML-Evaluator library should simply refuse to deal with them:

if(inputFields.size() > 1000){
  throw new EvaluationException("The model specifies unreasonably large number of input fields, which is indicative of bad data science/engineering process. Please refactor the model");
}

However, the limit should be programmatically customizable. If people want to do stupid things, then they should have technical means to do so.

a1singh commented 5 years ago

Short form: Can we please parametrize the max number of input fields, so users can change it ?

Long form: We have a legacy model, whose only remaining representation is in PMML format. Due to the input fields number limit of 1000, there is no way to score this legacy model. If parametrization is too time taking, I would suggest a more lenient upper limit (e.g. 5000)

The limit was introduced here: 060082c8a2af855e4ebaae873bbd5ea63cb6d3a2

vruusmann commented 5 years ago

@a1singh See https://stackoverflow.com/questions/54952581/how-to-avoid-the-max-amount-of-input-fields-for-jpmml

If anything, then this limit should be lower (around 100) in order to promote intelligent model design.

a1singh commented 5 years ago

@vruusmann Thanks for sharing. Agreed on the conceptual need for fewer features. Smallest feature set that gives good performance is what we should strive for.

The problem is legacy models already in production, especially those developed by other teams. Due to this hard coded limit, people might have to redo weeks, if not months, worth of work, for a model that was deployed by someone else.

Can we introduce a warning instead, "You are using a lot of features. Don't complain if the model evaluation is slow. Consider reducing the number of features."

vruusmann commented 5 years ago

Can we introduce a warning instead

Warnings don't work. It takes a big fat exception to drive the message home.

Anyway, this schema check is a ModelEvaluatorBuilder-level functionality. The returned ModelEvaluator instance doesn't implement any limits.

a1singh commented 5 years ago

Thanks Villu for the SO ticket. We were able to override the method, and make JPMML backwards compatible to legacy models.