avoid hard coding each variable name in the score function

sassoftware / sas-viya-dmml-pipelines

Code examples and supporting materials for data mining and machine learning techniques on the SAS Viya environment.

Apache License 2.0

30 stars 26 forks source link

avoid hard coding each variable name in the score function #10

Open ryanma9629 opened 1 year ago

ryanma9629 commented 1 year ago

In each of the hmeq_score.py files, the variable names are hard-coded into the score_method function. Is there any better way to do a scoring function definition? our customers' datasets often have thousands of variables and it is impractical to enter them one by one manually.

In addition, when there exist some preprocessing predecessors of the open source node, the variable names are often prefixed with IMP, WOEENC, etc. We expect to obtain such names automatically from the predecessor node via SAS macro variables, as in hmeq_train.py.

rmyneni commented 1 year ago

I agree that we could make writing score code simpler, using variables and not the actual inputs similar to what we support in train code. It requires some underlying code conversion before sending to Model Manager (where the actual scoring happens). Hopefully we can get to adding this capability soon; currently we are focused on rewriting our mid-tier and UI to reduce footprint for performance. Thanks for bringing this up, will pass along to product management for prioritization.

ryanma9629 commented 1 year ago

Thanks, Radhikha! I also have a question, why the scoring function is designed to score one record only? I'm not sure if there is any special consideration for this. Since most of our customer's python models are used for batch scoring, and the performance is very poor if scoring them one by one. Can the input be designed as a complete pandas dataframe and the output as a dataframe or a list of predictions for multiple records?

rmyneni commented 1 year ago

Sorry for the delayed response, I had to reach out to SAS Model Manager team for this. I totally understand the need for speeding the scoring process using batches but that kind of capability requires support from both the scoring function and the underlying runtime environment. And I was told that we will be able to support this in SCR (SAS Container Runtime) later this year for Python models.