Closed jibybabu closed 6 years ago
A lookup table (between two categorical value spaces) is represented in PMML using the MapValues
element. The MapValues
transformation can work with external data sources (such as CSV files or SQL databases), but most of the time it's easier if the data is inlined. The choice between those two options (ie. external source vs. inline source) depends on how frequently/on what extent this lookup table needs to be updated, and by whom.
In your case, if those ten features are independent of one another, then you should define ten independent MapValues
transformations (each with one input column and one output column). Otherwise, you should define one big MapValues
transformation.
The fact that you're dealing with a 100'000-element lookup table is a bit concerning, because it would probably require some special tweaking/configuration at the PMML consumer side (such as the JPMML-Evaluator library) in order to ensure the best performance.
As for conversion from Scikit-Learn representation to PMML representation, then it's a low-level technicality. What Scikit-Learn classes do you use for that in your application code? Can you provide a working example using some toy dataset?
Hi Vilu, Thanks for the quick turn around! excited to see your quick response!
So here is my story, I am building a model to solve a classification problem and needs to convert it into JPMML for deployment. One of model feature is a string, say name of a person, and my model looks for 10 additional features based numerical(decimal) values based on the name of person. I have kept this map in a CSV file (Name, Col1, Col2, Col 3....Col10) with rows for example("ABC",1.534346,5.1232343,....). So whenever a observation comes for prediction, the model needs to look for the respective 10 numerical values and use them for scoring the model.
Right now, i haven't used any skikitlearn transformations to map this, Just populating the column values using python by reading the lookup.csv file. And for deploying, converting into JPMML , it looks like a challenge now. I am ready to change the implementation.
Also, the lookup table doesn't need to be updated frequently. So that is why would like to embed everything into one file.
Any suggestions/advices are much appreciated!
Thanks, Jiby
The MapValues
transformation is designed for mapping many input values to one output value. You're interested in achieving exactly the opposite - mapping one input value (eg. entity's identifier) to many output values (eg. entity's descriptors).
You could achieve this by defining many MapValue
transformations - one for each output dimension. This could work if the size of input value set is relatively small and fixed.
Is the "input value set is fixed" requirement met in your use case? In other words, are you going to make predictions only about persons that were known at the time when the model was trained (and converted to PMML). Aren't you interested in making predictions about new persons?
Otherwise, I'd recommend using a layered approach, where the lookup table functionality is moved outside of the PMML document, to a specialized service. The latter could be some REST web service, or an SQL handler. You would be performing this call using a custom Java user-defined function (UDF):
<Apply function="com.mycompany.rest.PersonService">
<FieldRef name="Id"/> <!-- Entity identifier -->
<Constant>Col1, Col2, Col3, .., Col10</Constant> <!-- Entity attributes to select -->
</Apply>
Hi Vilu,
Thanks a lot for the quick response!
Is the "input value set is fixed" requirement met in your use case?Aren't you interested in making predictions about new persons?
Yes. As like this example, when a new person comes in future , i need to look into the lookup table and get the respective 10 values. I am taking care of the scenarios where the persons name is not in the lookup table.
I was thinking of a workaround; Let me know your suggestion on that prespective.
how about creating a customized transformation function, somthing like http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html, for each 10
Btw.. Do you have any example of converting sklearn pipeline to dataframeMapper using sklearn2pmml?
Thanks, Jiby
how about creating a customized transformation function, somthing like http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html, for each 10 mappings, put that into an sklearn pipeline and convert the pipeline into dataframeMapper ?
That would be unnecessarily complex solution. It's possible to define a new Scikit-Learner transformer class and map it directly to any number of columns:
person_mapper = DataFrameMapper([
((Gender, Age, Height), CSVLookupTable("persons.csv", "ID"))
])
Class CSVLookupTable
would be defined in sklearn2pmml.preprocessing
module. It takes the name of the filesystem file, and the name of the primary key column as arguments.
Do you have any example of converting sklearn pipeline to dataframeMapper using sklearn2pmml?
The conversion of Scikit-Learn pipelines is discussed in a separate issue: https://github.com/jpmml/jpmml-sklearn/issues/3
Thanks Villu, Thats a great approach! Let me try it out that.
Btw, just to make sure, about passing the "ID", the "ID" should the primary column name of the lookUpTable rather than the respective data set column name correct? And this will make sure that we are taking care of scenarios outside of the data set right?
Thanks, Jiby
You can't try it out, because class sklearn2pmml.preprocessing.CSVLookupTable
is fictional.
But the solution to your problem could be implemented like this. If you can implement the Python side of this class, then I can do the rest.
I implemented that in the python just now. But the problem is do the variables in the mapper and estimator while doing sklearn2python to be synced? I am not using the input variable, for example the name in the model, but the output variables, for example Gender, Age, Height, in the model. Because if the input variable is not there in the JPMML , how java can map it? Also i wont be able to take of the new input variable,say name , which are not in the dataset but in the look up table?
And its returning the error like
SEVERE: Failed to convert Estimator
java.lang.IllegalArgumentException
at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)
at sklearn.Classifier.createSchema(Classifier.java:59)
at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
at org.jpmml.sklearn.Main.run(Main.java:189)
at org.jpmml.sklearn.Main.main(Main.java:107)
Exception in thread "main" java.lang.IllegalArgumentException
at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)
at sklearn.Classifier.createSchema(Classifier.java:59)
at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
at org.jpmml.sklearn.Main.run(Main.java:189)
at org.jpmml.sklearn.Main.main(Main.java:107)
Traceback (most recent call last):
File "/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/IPython/core/interactiveshell.py", line 2869, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-48-dcba1c38aa0e>", line 1, in <module>
sklearn2pmml(rf, jobs_mapper, "sample.pmml", with_repr = False)
File "/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.py", line 65, in sklearn2pmml
subprocess.check_call(cmd)
File "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['java', '-cp', '/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pyrolite-4.13.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/var/folders/7r/mhw1rgdd729cw3rvrm2mtcj40000gn/T/estimator-3qlFmc.pkl.z', '--pkl-mapper-input', '/var/folders/7r/mhw1rgdd729cw3rvrm2mtcj40000gn/T/mapper-iKqPdf.pkl.z', '--pmml-output', 'sample.pmml']' returned non-zero exit status 1
A typical workflow for implementing a custom transformer:
MapValues
transformations. You can communicate the name of the future input column by defining an appropriate "helper attribute" in the above Python class.The Java part might be tricky, especially considering that there's not much documentation about it. But you could take a look at earlier commits that implemented different Scikit-Learn transformers. For example: https://github.com/jpmml/jpmml-sklearn/commit/5c4a181cd70d06c8179586cb3be0c881a9ba02e0
I will be unable to work on this till the 1st of November.
Ok. Thanks a ton Villu! I can take care of the python. Please let me know here, i will check constantly, when it is ready!
Also just on a side note, it will be great if you could implement it in a smarter way, as u r always, sothat any of the future crazy transformations can be taken care if somebody implements the corresponding python class
The MapValues transformation is designed for mapping many input values to one output value.
Thanks for pointing me towards MapValues for representing a lookup table using PMML, the InlineTable type looks ideal.
In the application I'm working on the PMML needs to represent only that lookup logic, i.e. map from all combinations of 2 input values (e.g. age, height) to some output/target/label value (e.g. isTall).
Having read the above I now think I should create a MapValues type DerivedField, and simply assign the value of that DerivedField to the output of the pmml model.
Is there an existing/simple/recommended way to produce such a pmml? I can always write some code to produce the inlineTable xml representation from a pandas dataframe or something, but perhaps there's an existing or better solution. Although it feels unnecessary, are there any transformations in sklearn that would produce such a MapValues DerivedField?
I opened this SO question just before I found this thread: http://stackoverflow.com/questions/40498703/how-to-generate-a-pmml-that-represents-a-simple-lookup-table-logic-using-python
Thanks!
Thanks a lot Villu! Much awaited one!
On Mon, Jan 8, 2018 at 2:28 AM Villu Ruusmann notifications@github.com wrote:
Closed #14 https://github.com/jpmml/sklearn2pmml/issues/14 via 9d09570 https://github.com/jpmml/sklearn2pmml/commit/9d0957032eee352fd8572deee93463be62264f5a .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jpmml/sklearn2pmml/issues/14#event-1412869560, or mute the thread https://github.com/notifications/unsubscribe-auth/ARTutmcHFoAq9QWeizykLeUHeYet5wwXks5tIS_4gaJpZM4KZLBt .
Hi,
I was wondering if there is any way i can embed a lookup table into the PMML file using sklearn2pmml. For example, My model is having 40 features. Out of 40, 10 features i am filling using a CSV file which is of shape 100000 X 10 . I would like to attach this info to the PMML file sothat when the model is running, it will take the value from there.
Thanks, Jiby