jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
684 stars 113 forks source link

JPMML model not having all the specified fields #60

Closed nejatb closed 7 years ago

nejatb commented 7 years ago

Hi. Sorry I didnt find any groups to make an issue in. I have written a PMMLPipeline as follows but after the model is created, I only see field1 as the input field. All inputs come with both fields. Do you have any ideas as to what might be causing this issue? ...

rows_list = []
     for i, text in enumerate(raw_data):
         value = text.split('\t')
         rows_list.append({'field1':value[0],'field2':value[1]})

data = pd.DataFrame(rows_list,dtype=str)

# split the dataset in training and test set for cross validation
docs_train, docs_test, y_train, y_test = train_test_split(
    data, target, test_size=0.25, random_state=42)

pipeline = PMMLPipeline([

    ("mapper", DataFrameMapper([
                ("field1", TfidfVectorizer(stop_words = 'english',
                                            norm = None,
                                            tokenizer = Splitter(),
                                            ngram_range = (1,1),
                                            max_df = 0.9,
                                            min_df = 2
                                            )),
                ("field2",TfidfVectorizer(stop_words = 'english',
                                            norm = None,
                                            tokenizer = Splitter(),
                                            ngram_range = (1,1),
                                            max_df = 0.7,
                                            min_df = 5
                                        ))
                ],df_out=True) ),

    ("selector", SelectKBest(chi2,k=500)),
    # Use a SVC classifier on the combined features
    ("classifier", SVC(kernel='linear', tol=1e-3)),
])

try:
    pipeline.fit(docs_train, y_train)
    print("predicting on test")
    y_predicted = pipeline.predict(docs_test)

    # Print the classification report
    print("Classification report")
    print(metrics.classification_report(y_test, y_predicted))

    sklearn2pmml(pipeline,'model.pmml')

except Exception as e:
    print(e.with_traceback())
vruusmann commented 7 years ago

The fact that PMML "contains" less columns than the original Python data matrix is a feature, not a bug. The PMML conversion library keeps track which features are actually used by the model, and eliminates all unused ones. This leads to smaller PMML files that perform better.

Isn't that what you want?

("selector", SelectKBest(chi2,k=500))

It reads: "Keep the best 500 features, and discard all others". If you compare (J)PMML and Scikit-Learn predictions, then they will be exactly the same (absolute/relative precision of 1e-13 or better) - so there's no issue here.

vruusmann commented 7 years ago

It's also worth pointing out that several model types may trigger further feature elimination. The prime example are all decision trees and their ensembles (eg. random forests, gradient boosted trees) - if a feature is not involved in any tree splits, then it's also eliminated from the feature set (why compute a feature when it's not used by the prediction logic?).

The JPMML-Converter library, which underlies the JPMML-SkLearn library, inspects the (Lib)SVM data structure to identify no-op columns (pay attention to the featureMask local variable): https://github.com/jpmml/jpmml-converter/blob/master/src/main/java/org/jpmml/converter/support_vector_machine/LibSVMUtil.java#L119-L194

Therefore, even if you originally intended to retain 500 features, then the final PMML file may "only" contain 350 to 400 features - the "missing" features were eliminated because the (Lib)SVM data structure indicates that they have no discriminative power.

vruusmann commented 7 years ago

See also https://github.com/jpmml/jpmml-sklearn/issues/49