jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
688 stars 113 forks source link

Why the pmml file is too large? #10

Closed yuchenlin closed 8 years ago

yuchenlin commented 8 years ago

Here is my code:

X_train_1, y_train_1 = load_svmlight_file('test.txt') clf = RandomForestClassifier(n_estimators=10, n_jobs=-1, class_weight="balanced") clf = clf.fit(X_train_1, y_train_1) from sklearn2pmml import sklearn2pmml sklearn2pmml(clf, None, "rfmodel_test.pmml", with_repr = False)


I thought the size of output (pmml file) should be irrelevant to the size of my training dataset. ( It should be only related to n_estimators and the number of features )

But in the end I found that the size of pmml file is very large (3.8G which is just slightly smaller than my training dataset) , and if I use small dataset the size of pmml file becomes small.

I am so confused.

vruusmann commented 8 years ago

It is a "special property" of tree models (and their ensembles such as Random Forest models) that the size of the model object increases as the size/complexity of the dataset increases.

To solve the problem, you must change the parameterization of RandomForestClassifier object so that the learning algorithm would limit the complexity of tree objects:

For example:

clf = RandomForestClassifier(n_estimators = 100, max_depth = 10, min_samples_split = 100, min_samples_leaf = 5)

I would advise you to increase the number of estimators (ie. set the n_estimators parameter to much greater value than 10), and focus on configuring the size of individual estimator trees.

It's not PMML problem per se. For example, if you simply serialized your RF models in Python's native pickle data format, then you'd see exactly the same happening - a small dataset would give you a small pickle file, whereas a large dataset would give you a much bigger pickle file.

raosudhir commented 6 years ago

Hello! I am running into the same issue, if I can call it that.....too large a PMML file! Here is my code with pipeline, and classifier params:

rfcf_pipeline = PMMLPipeline([("classifier", rfcf)])
rfcf_pipeline.fit(x_train, y_train)
rfcf.compact = True;
sklearn2pmml(rfcf_pipeline, "pmml/RandomForestClassifierPipeline.pmml")

#

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False))

My training dataset has 557K records and the generated PMML file is 2.7G in size. I currently have 20 features and haven't finalized them yet. Likely I'll remove some but add other ones.

I think that the basic issue causing the PMML file size to bloat would be the depth of my trees.

The predicted results are good and wouldn't want to have to lower the quality of the results (who would? :~))

Any recommendations on the parameters to tinker with? I was hoping setting "classifier.compact = True" will help reduce the size of the generated PMML file, but it didn't.

vruusmann commented 6 years ago

rfcf.compact = True;

That's an outdated "option configuration" syntax. You should be using PMMLPipeline.configure(**pmml_options) now.

The compact = True option, when actually applied, should decrease the size of the PMML file around 50%. Export the same pipeline first with compact = False and then with compact = True.

Also, please note the this 2.7 GB PMML file probably contains 2 GB of "XML markup" and 0.7 GB of whitespace (in the form of tab characters). You may safely strip the latter.

Anyway, the size of the PMML file in local filesystem is quite irrelevant. What really needs to be optimized is RAM consumption when it is parsed/deployed in the production system.