Closed meta closed 2 years ago
Thanks, yes, this is a problem currently. (Linking #551 as well) And thank you for providing details, that will help us troubleshoot!
Yea how is that we don't hit this in our pipeline?
@meta can you share more of your test code so we can repo the error?
sure, here's the repro notebook: https://github.com/meta/notebooks/blob/main/hummingbird_xgboost.ipynb
Ok so apparently the problem is pandas used for training.
If you change xg_reg.fit(X_train, y_train)
with xg_reg.fit(X_train.to_numpy(), y_train.to_numpy())
it should work.
For predict you can still use pandas, so you can do xg_torch.predict(X_test)
.
Still we will try to fix the pandas problem for training because it is not that convenient to force users to use numpy for training xgboost models.
I'm still receiving the same error using v0.4.9 with string feature names in the trained XGBClassifier model upon using convert(model, "torch")
.
Using to_numpy()
in model.fit()
works though.
Am I missing something here?
Complete stack trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/asad/Desktop/v2_final_model.ipynb Cell 12 line 2
1 # convert XGB model to torch using hummingbird
----> 2 model_torch = convert(model, 'torch')
4 print(model_torch)
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/convert.py:444, in convert(model, backend, test_input, device, extra_config)
409 """
410 This function converts the specified input *model* into an implementation targeting *backend*.
411 *Convert* supports [Sklearn], [LightGBM], [XGBoost], [ONNX], and [SparkML] models.
(...)
441 A model implemented in *backend*, which is equivalent to the input model
442 """
443 assert constants.REMAINDER_SIZE not in extra_config
--> 444 return _convert_common(model, backend, test_input, device, extra_config)
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/convert.py:394, in _convert_common(model, backend, test_input, device, extra_config)
391 _supported_backend_check_config(model, backend_formatted, extra_config)
393 if type(model) in xgb_operator_list:
--> 394 return _convert_xgboost(model, backend_formatted, test_input, device, extra_config)
396 if type(model) in lgbm_operator_list:
397 return _convert_lightgbm(model, backend_formatted, test_input, device, extra_config)
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/convert.py:153, in _convert_xgboost(model, backend, test_input, device, extra_config)
148 else:
149 raise RuntimeError(
150 "XGBoost converter is not able to infer the number of input features.\
151 Please pass some test_input to the converter."
152 )
--> 153 return _convert_sklearn(model, backend, test_input, device, extra_config)
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/convert.py:111, in _convert_sklearn(model, backend, test_input, device, extra_config)
108 topology = parse_sklearn_api_model(model, extra_config)
110 # Convert the Topology object into a PyTorch model.
--> 111 hb_model = topology_converter(topology, backend, test_input, device, extra_config=extra_config)
112 return hb_model
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/_topology.py:222, in convert(topology, backend, test_input, device, extra_config)
215 if parse(torch.__version__) <= Version("1.4"):
216 # Raise en error and warn user that the torch version is not supported with onnx backend
217 raise Exception(
218 f"The current torch version {torch.__version__} is not supported with {backend} backend. "
219 "Please use a torch version > 1.4 or change the backend."
220 )
--> 222 operator_map[operator.full_name] = converter(operator, device, extra_config)
224 # Set the parameters for the model / container
225 n_threads = None if constants.N_THREADS not in extra_config else extra_config[constants.N_THREADS]
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/operator_converters/xgb.py:107, in convert_sklearn_xgb_classifier(operator, device, extra_config)
104 tree_infos = operator.raw_operator.get_booster().get_dump()
105 n_classes = operator.raw_operator.n_classes_
--> 107 return convert_gbdt_classifier_common(
108 operator, tree_infos, _get_tree_parameters, n_features, n_classes, decision_cond="<", extra_config=extra_config
109 )
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/operator_converters/_gbdt_commons.py:63, in convert_gbdt_classifier_common(operator, tree_infos, get_tree_parameters, n_features, n_classes, classes, extra_config, decision_cond)
60 if reorder_trees and n_classes > 1:
61 tree_infos = [tree_infos[i * n_classes + j] for j in range(n_classes) for i in range(len(tree_infos) // n_classes)]
---> 63 return convert_gbdt_common(
64 operator, tree_infos, get_tree_parameters, n_features, classes, extra_config=extra_config, decision_cond=decision_cond
65 )
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/operator_converters/_gbdt_commons.py:89, in convert_gbdt_common(operator, tree_infos, get_tree_parameters, n_features, classes, extra_config, decision_cond)
86 assert get_tree_parameters is not None
87 assert n_features is not None
---> 89 tree_parameters, max_depth, tree_type = get_tree_params_and_type(tree_infos, get_tree_parameters, extra_config)
91 # Apply learning rate directly on the values rather then at runtime.
92 if constants.LEARNING_RATE in extra_config:
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/operator_converters/_tree_commons.py:223, in get_tree_params_and_type(tree_infos, get_tree_parameters, extra_config)
210 def get_tree_params_and_type(tree_infos, get_tree_parameters, extra_config):
211 """
212 Populate the parameters from the trees and pick the tree implementation strategy.
213
(...)
221 The tree parameters, the maximum tree-depth and the tre implementation to use
222 """
--> 223 tree_parameters = [get_tree_parameters(tree_info, extra_config) for tree_info in tree_infos]
224 max_depth = max(1, _find_max_depth(tree_parameters))
225 tree_type = get_tree_implementation_by_config_or_depth(extra_config, max_depth)
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/operator_converters/_tree_commons.py:223, in <listcomp>(.0)
210 def get_tree_params_and_type(tree_infos, get_tree_parameters, extra_config):
211 """
212 Populate the parameters from the trees and pick the tree implementation strategy.
213
(...)
221 The tree parameters, the maximum tree-depth and the tre implementation to use
222 """
--> 223 tree_parameters = [get_tree_parameters(tree_info, extra_config) for tree_info in tree_infos]
224 max_depth = max(1, _find_max_depth(tree_parameters))
225 tree_type = get_tree_implementation_by_config_or_depth(extra_config, max_depth)
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/operator_converters/xgb.py:77, in _get_tree_parameters(tree_info, extra_config)
75 for f_id, f_name in enumerate(feature_names):
76 tree_info = tree_info.replace(f_name, str(f_id))
---> 77 _tree_traversal(
78 tree_info.replace("[f", "").replace("[", "").replace("]", "").split(), lefts, rights, features, thresholds, values
79 )
81 return TreeParameters(lefts, rights, features, thresholds, values)
File ~/anaconda3/envs/zkml-ezkl/lib/python3.10/site-packages/hummingbird/ml/operator_converters/xgb.py:33, in _tree_traversal(tree_info, lefts, rights, features, thresholds, values)
31 count += 1
32 else:
---> 33 features.append(int(tree_info[count].split(":")[1].split("<")[0].replace("[f", "")))
34 thresholds.append(float(tree_info[count].split(":")[1].split("<")[1].replace("]", "")))
35 values.append([-1])
ValueError: invalid literal for int() with base 10: 'liquidation_time_since_last_liquidated'
It looks that his happening because you have a categorical feature which we don't support yet.
Thanks for the quick revert. However, all feature values are scaled floats.
In xgboost 1.5.0 release, a breaking change is introduced to the model format causing this error when trying to use
convert
:The error happens because this line: https://github.com/microsoft/hummingbird/blob/main/hummingbird/ml/operator_converters/xgb.py#L33
Previously, XGBoost seemed to use an indexed integer to store the feature name, but now it's changed to the alphabet name of the feature. I tried changing the conversion to use other bases, say
36
to get some 1-1 integer map to the feature name, but is causing another issue down the road in the indexing. Seemed like the lib may need to build some indexing to the feature.Dataset used:
sklearn.datasets.load_boston