jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
686 stars 113 forks source link

add a lookup table into the PMML #14

Closed jibybabu closed 6 years ago

jibybabu commented 8 years ago

Hi,

I was wondering if there is any way i can embed a lookup table into the PMML file using sklearn2pmml. For example, My model is having 40 features. Out of 40, 10 features i am filling using a CSV file which is of shape 100000 X 10 . I would like to attach this info to the PMML file sothat when the model is running, it will take the value from there.

Thanks, Jiby

vruusmann commented 8 years ago

A lookup table (between two categorical value spaces) is represented in PMML using the MapValues element. The MapValues transformation can work with external data sources (such as CSV files or SQL databases), but most of the time it's easier if the data is inlined. The choice between those two options (ie. external source vs. inline source) depends on how frequently/on what extent this lookup table needs to be updated, and by whom.

In your case, if those ten features are independent of one another, then you should define ten independent MapValues transformations (each with one input column and one output column). Otherwise, you should define one big MapValues transformation.

The fact that you're dealing with a 100'000-element lookup table is a bit concerning, because it would probably require some special tweaking/configuration at the PMML consumer side (such as the JPMML-Evaluator library) in order to ensure the best performance.

As for conversion from Scikit-Learn representation to PMML representation, then it's a low-level technicality. What Scikit-Learn classes do you use for that in your application code? Can you provide a working example using some toy dataset?

jibybabu commented 8 years ago

Hi Vilu, Thanks for the quick turn around! excited to see your quick response!

So here is my story, I am building a model to solve a classification problem and needs to convert it into JPMML for deployment. One of model feature is a string, say name of a person, and my model looks for 10 additional features based numerical(decimal) values based on the name of person. I have kept this map in a CSV file (Name, Col1, Col2, Col 3....Col10) with rows for example("ABC",1.534346,5.1232343,....). So whenever a observation comes for prediction, the model needs to look for the respective 10 numerical values and use them for scoring the model.

Right now, i haven't used any skikitlearn transformations to map this, Just populating the column values using python by reading the lookup.csv file. And for deploying, converting into JPMML , it looks like a challenge now. I am ready to change the implementation.

Also, the lookup table doesn't need to be updated frequently. So that is why would like to embed everything into one file.

Any suggestions/advices are much appreciated!

Thanks, Jiby

vruusmann commented 8 years ago

The MapValues transformation is designed for mapping many input values to one output value. You're interested in achieving exactly the opposite - mapping one input value (eg. entity's identifier) to many output values (eg. entity's descriptors).

You could achieve this by defining many MapValue transformations - one for each output dimension. This could work if the size of input value set is relatively small and fixed.

Is the "input value set is fixed" requirement met in your use case? In other words, are you going to make predictions only about persons that were known at the time when the model was trained (and converted to PMML). Aren't you interested in making predictions about new persons?

Otherwise, I'd recommend using a layered approach, where the lookup table functionality is moved outside of the PMML document, to a specialized service. The latter could be some REST web service, or an SQL handler. You would be performing this call using a custom Java user-defined function (UDF):

<Apply function="com.mycompany.rest.PersonService">
  <FieldRef name="Id"/> <!-- Entity identifier -->
  <Constant>Col1, Col2, Col3, .., Col10</Constant> <!-- Entity attributes to select -->
</Apply>
jibybabu commented 8 years ago

Hi Vilu,

Thanks a lot for the quick response!

Is the "input value set is fixed" requirement met in your use case?Aren't you interested in making predictions about new persons?

Yes. As like this example, when a new person comes in future , i need to look into the lookup table and get the respective 10 values. I am taking care of the scenarios where the persons name is not in the lookup table.

I was thinking of a workaround; Let me know your suggestion on that prespective. how about creating a customized transformation function, somthing like http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html, for each 10 mappings, put that into an sklearn pipeline and convert the pipeline into dataframeMapper ?

Btw.. Do you have any example of converting sklearn pipeline to dataframeMapper using sklearn2pmml?

Thanks, Jiby

vruusmann commented 8 years ago

how about creating a customized transformation function, somthing like http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html, for each 10 mappings, put that into an sklearn pipeline and convert the pipeline into dataframeMapper ?

That would be unnecessarily complex solution. It's possible to define a new Scikit-Learner transformer class and map it directly to any number of columns:

person_mapper = DataFrameMapper([
  ((Gender, Age, Height), CSVLookupTable("persons.csv", "ID"))
])

Class CSVLookupTable would be defined in sklearn2pmml.preprocessing module. It takes the name of the filesystem file, and the name of the primary key column as arguments.

Do you have any example of converting sklearn pipeline to dataframeMapper using sklearn2pmml?

The conversion of Scikit-Learn pipelines is discussed in a separate issue: https://github.com/jpmml/jpmml-sklearn/issues/3

jibybabu commented 8 years ago

Thanks Villu, Thats a great approach! Let me try it out that.

Btw, just to make sure, about passing the "ID", the "ID" should the primary column name of the lookUpTable rather than the respective data set column name correct? And this will make sure that we are taking care of scenarios outside of the data set right?

Thanks, Jiby

vruusmann commented 8 years ago

You can't try it out, because class sklearn2pmml.preprocessing.CSVLookupTable is fictional.

But the solution to your problem could be implemented like this. If you can implement the Python side of this class, then I can do the rest.

jibybabu commented 8 years ago

I implemented that in the python just now. But the problem is do the variables in the mapper and estimator while doing sklearn2python to be synced? I am not using the input variable, for example the name in the model, but the output variables, for example Gender, Age, Height, in the model. Because if the input variable is not there in the JPMML , how java can map it? Also i wont be able to take of the new input variable,say name , which are not in the dataset but in the look up table?

And its returning the error like

SEVERE: Failed to convert Estimator
java.lang.IllegalArgumentException
    at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)
    at sklearn.Classifier.createSchema(Classifier.java:59)
    at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
    at org.jpmml.sklearn.Main.run(Main.java:189)
    at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" java.lang.IllegalArgumentException
    at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)
    at sklearn.Classifier.createSchema(Classifier.java:59)
    at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
    at org.jpmml.sklearn.Main.run(Main.java:189)
    at org.jpmml.sklearn.Main.main(Main.java:107)
Traceback (most recent call last):
  File "/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/IPython/core/interactiveshell.py", line 2869, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-48-dcba1c38aa0e>", line 1, in <module>
    sklearn2pmml(rf, jobs_mapper, "sample.pmml", with_repr = False)
  File "/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.py", line 65, in sklearn2pmml
    subprocess.check_call(cmd)
  File "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['java', '-cp', '/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pyrolite-4.13.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/var/folders/7r/mhw1rgdd729cw3rvrm2mtcj40000gn/T/estimator-3qlFmc.pkl.z', '--pkl-mapper-input', '/var/folders/7r/mhw1rgdd729cw3rvrm2mtcj40000gn/T/mapper-iKqPdf.pkl.z', '--pmml-output', 'sample.pmml']' returned non-zero exit status 1
vruusmann commented 8 years ago

A typical workflow for implementing a custom transformer:

  1. Create Python class. This should load a CSV file, and insert three columns (Gender, Age, Height) to the resulting Python data matrix that the estimator can see and use.
  2. Create Java class. This should load the same CSV file, and generate three MapValues transformations. You can communicate the name of the future input column by defining an appropriate "helper attribute" in the above Python class.
  3. Update the sklearn2pmml package to include both Python and Java classes.

The Java part might be tricky, especially considering that there's not much documentation about it. But you could take a look at earlier commits that implemented different Scikit-Learn transformers. For example: https://github.com/jpmml/jpmml-sklearn/commit/5c4a181cd70d06c8179586cb3be0c881a9ba02e0

I will be unable to work on this till the 1st of November.

jibybabu commented 8 years ago

Ok. Thanks a ton Villu! I can take care of the python. Please let me know here, i will check constantly, when it is ready!

Also just on a side note, it will be great if you could implement it in a smarter way, as u r always, sothat any of the future crazy transformations can be taken care if somebody implements the corresponding python class

le-vision commented 8 years ago

The MapValues transformation is designed for mapping many input values to one output value.

Thanks for pointing me towards MapValues for representing a lookup table using PMML, the InlineTable type looks ideal.

In the application I'm working on the PMML needs to represent only that lookup logic, i.e. map from all combinations of 2 input values (e.g. age, height) to some output/target/label value (e.g. isTall).

Having read the above I now think I should create a MapValues type DerivedField, and simply assign the value of that DerivedField to the output of the pmml model.

Is there an existing/simple/recommended way to produce such a pmml? I can always write some code to produce the inlineTable xml representation from a pandas dataframe or something, but perhaps there's an existing or better solution. Although it feels unnecessary, are there any transformations in sklearn that would produce such a MapValues DerivedField?

I opened this SO question just before I found this thread: http://stackoverflow.com/questions/40498703/how-to-generate-a-pmml-that-represents-a-simple-lookup-table-logic-using-python

Thanks!

jibybabu commented 6 years ago

Thanks a lot Villu! Much awaited one!

On Mon, Jan 8, 2018 at 2:28 AM Villu Ruusmann notifications@github.com wrote:

Closed #14 https://github.com/jpmml/sklearn2pmml/issues/14 via 9d09570 https://github.com/jpmml/sklearn2pmml/commit/9d0957032eee352fd8572deee93463be62264f5a .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jpmml/sklearn2pmml/issues/14#event-1412869560, or mute the thread https://github.com/notifications/unsubscribe-auth/ARTutmcHFoAq9QWeizykLeUHeYet5wwXks5tIS_4gaJpZM4KZLBt .