SparkML random forest classifier test not working.

sanatanSharma commented 5 years ago

I was trying to run the test on my local spark but the code is not working. I've pasted the exact code which I ran down below and it breaks at the last line, compare_results(expected, output, decimal=5). Almost all of the code below is copy-pasted from the actual test here.

import sys
import inspect
import unittest
import os
from distutils.version import StrictVersion

import onnx
import pandas
import numpy
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.linalg import VectorUDT, SparseVector

from onnxmltools import convert_sparkml
from onnxmltools.convert.common.data_types import StringTensorType, FloatTensorType
from tests.sparkml.sparkml_test_utils import save_data_models, run_onnx_model, compare_results
from tests.sparkml import SparkMlTestCase
from pyspark.ml.feature import StringIndexer, VectorIndexer

sc = SparkContext()
spark = SparkSession(sc)

original_data = spark.read.format("libsvm").load("/Users/sanashar/sample.txt")

feature_count = 5
spark.udf.register("truncateFeatures",
                        lambda x: SparseVector(feature_count, range(0,feature_count), x.toArray()[125:130]),
                        VectorUDT())
data = original_data.selectExpr("cast(label as string) as label", "truncateFeatures(features) as features")
label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures",
                                maxCategories=10, handleInvalid='keep')

rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
pipeline = Pipeline(stages=[label_indexer, feature_indexer, rf])
model = pipeline.fit(data)
model_onnx = convert_sparkml(model, 'Sparkml RandomForest Classifier', [
    ('label', StringTensorType([1, 1])),
    ('features', FloatTensorType([1, feature_count]))
], spark_session=spark)

predicted = model.transform(data)
data_np = {
    'label': data.toPandas().label.values,
    'features': data.toPandas().features.apply(lambda x: pandas.Series(x.toArray())).values.astype(numpy.float32)
}
expected = [
    predicted.toPandas().indexedLabel.values.astype(numpy.int64),
    predicted.toPandas().prediction.values.astype(numpy.float32),
    predicted.toPandas().probability.apply(lambda x: pandas.Series(x.toArray())).values.astype(numpy.float32)
]
paths = save_data_models(data_np, expected, model, model_onnx,
                            basename="SparkmlRandomForestClassifier")
onnx_model_path = paths[3]
output, output_shapes = run_onnx_model(['indexedLabel', 'prediction', 'probability'], data_np, onnx_model_path)

compare_results(expected, output, decimal=5)

Since, this was not working out, I wrote a little line to compare predictions myself, output[1] == expected[1], which showed that the expected and the outputs obtained through onnxruntime don't match. Also, sometimes my kernel dies at the run_onnx_model call, which is weird too.

I'm not sure what's going on here and any help would be appreciated.

xadupre commented 5 years ago

I can't replicate the issue. Do you obtain the same error with the unit test? Which version of pyspark, onnxruntime, onnx, onnxmltools are you using?

sanatanSharma commented 5 years ago

I'm using Spark 2.4.3, onnxruntime 0.5.0, onnxmltools 1.5.0, onnx 1.5.0. I didn't run the actual test, just copied code from there on my local and downloaded the required file "sample.txt".

sanatanSharma commented 5 years ago

I'm also using Python 3.7.3

onnx / onnxmltools

SparkML random forest classifier test not working. #330