Describe the bug
When EMBER predicts malware, it prompts that LightGBMError: The number of features in data (73802) is not the same as it was in training data (2381).
I searched for this bug and found that the version of lief did not match. But python 3.9 can't install lief==0.9.0
To Reproduce
Expected behavior
The malware can be predicted normally by EMBER.
Library info
lief == 0.12.1
python == 3.9.12
System info (please complete the following information):
OS: [Windows]
Version [0.2.4]
Additional context
The following is the detailed error reporting information
LightGBMError Traceback (most recent call last)
Input In [18], in <cell line: 6>()
18 max_length = max(max_length, len(code))
19 x = CArray(np.frombuffer(code, dtype=np.uint8)).atleast_2d()
---> 20 print(net.predict(x, return_decision_function=False))
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\classifiers\c_classifier.py:293, in CClassifier.predict(self, x, return_decision_function)
266 def predict(self, x, return_decision_function=False):
267 """Perform classification of each pattern in x.
268
269 If preprocess has been specified,
(...)
291
292 """
--> 293 scores = self.decision_function(x, y=None)
295 # The classification label is the label of the class
296 # associated with the highest score
297 labels = scores.argmax(axis=1).ravel()
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\classifiers\c_classifier.py:222, in CClassifier.decision_function(self, x, y)
194 def decision_function(self, x, y=None):
195 """Computes the decision function for each pattern in x.
196
197 If a preprocess has been specified, input is normalized
(...)
220
221 """
--> 222 scores = self.forward(x, caching=False)
223 return scores if y is None else scores[:, y].ravel()
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\c_module.py:204, in CModule.forward(self, x, caching)
202 # Transform data using inner preprocess, if defined
203 x = self._forward_preprocess(x=x, caching=caching)
--> 204 return self._forward(x)
File ~\Desktop\secml_malware\secml_malware\models\c_classifier_ember.py:63, in CClassifierEmber._forward(self, x)
61 def _forward(self, x):
62 x = x.atleast_2d()
---> 63 scores = self._lightgbm_model.predict(x.tondarray())
64 confidence = [[1 - c, c] for c in scores]
65 confidence = CArray(confidence)
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:125, in _safe_call(ret)
117 """Check the return value from C API call.
118
119 Parameters
(...)
122 The return value from C API calls.
123 """
124 if ret != 0:
--> 125 raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
LightGBMError: The number of features in data (73802) is not the same as it was in training data (2381).
You can set predict_disable_shape_check=true to discard this error, but please be aware what you are doing.
Describe the bug When EMBER predicts malware, it prompts that LightGBMError: The number of features in data (73802) is not the same as it was in training data (2381).
I searched for this bug and found that the version of lief did not match. But python 3.9 can't install lief==0.9.0
To Reproduce
Expected behavior The malware can be predicted normally by EMBER.
Library info lief == 0.12.1 python == 3.9.12
System info (please complete the following information):
Additional context
The following is the detailed error reporting information
LightGBMError Traceback (most recent call last) Input In [18], in <cell line: 6>() 18 max_length = max(max_length, len(code)) 19 x = CArray(np.frombuffer(code, dtype=np.uint8)).atleast_2d() ---> 20 print(net.predict(x, return_decision_function=False))
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\classifiers\c_classifier.py:293, in CClassifier.predict(self, x, return_decision_function) 266 def predict(self, x, return_decision_function=False): 267 """Perform classification of each pattern in x. 268 269 If preprocess has been specified, (...) 291 292 """ --> 293 scores = self.decision_function(x, y=None) 295 # The classification label is the label of the class 296 # associated with the highest score 297 labels = scores.argmax(axis=1).ravel()
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\classifiers\c_classifier.py:222, in CClassifier.decision_function(self, x, y) 194 def decision_function(self, x, y=None): 195 """Computes the decision function for each pattern in x. 196 197 If a preprocess has been specified, input is normalized (...) 220 221 """ --> 222 scores = self.forward(x, caching=False) 223 return scores if y is None else scores[:, y].ravel()
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\c_module.py:204, in CModule.forward(self, x, caching) 202 # Transform data using inner preprocess, if defined 203 x = self._forward_preprocess(x=x, caching=caching) --> 204 return self._forward(x)
File ~\Desktop\secml_malware\secml_malware\models\c_classifier_ember.py:63, in CClassifierEmber._forward(self, x) 61 def _forward(self, x): 62 x = x.atleast_2d() ---> 63 scores = self._lightgbm_model.predict(x.tondarray()) 64 confidence = [[1 - c, c] for c in scores] 65 confidence = CArray(confidence)
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:3538, in Booster.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape, **kwargs) 3536 else: 3537 num_iteration = -1 -> 3538 return predictor.predict(data, start_iteration, num_iteration, 3539 raw_score, pred_leaf, pred_contrib, 3540 data_has_header, is_reshape)
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:848, in _InnerPredictor.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape) 846 preds, nrow = self.__pred_for_csc(data, start_iteration, num_iteration, predict_type) 847 elif isinstance(data, np.ndarray): --> 848 preds, nrow = self.__pred_for_np2d(data, start_iteration, num_iteration, predict_type) 849 elif isinstance(data, list): 850 try:
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:938, in _InnerPredictor.__pred_for_np2d(self, mat, start_iteration, num_iteration, predict_type) 936 return preds, nrow 937 else: --> 938 return inner_predict(mat, start_iteration, num_iteration, predict_type)
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:908, in _InnerPredictor.__pred_for_np2d..inner_predict(mat, start_iteration, num_iteration, predict_type, preds)
906 raise ValueError("Wrong length of pre-allocated predict array")
907 out_num_preds = ctypes.c_int64(0)
--> 908 _safe_call(_LIB.LGBM_BoosterPredictForMat(
909 self.handle,
910 ptr_data,
911 ctypes.c_int(type_ptr_data),
912 ctypes.c_int32(mat.shape[0]),
913 ctypes.c_int32(mat.shape[1]),
914 ctypes.c_int(C_API_IS_ROW_MAJOR),
915 ctypes.c_int(predict_type),
916 ctypes.c_int(start_iteration),
917 ctypes.c_int(num_iteration),
918 c_str(self.pred_parameter),
919 ctypes.byref(out_num_preds),
920 preds.ctypes.data_as(ctypes.POINTER(ctypes.c_double))))
921 if n_preds != out_num_preds.value:
922 raise ValueError("Wrong length for predict results")
File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:125, in _safe_call(ret) 117 """Check the return value from C API call. 118 119 Parameters (...) 122 The return value from C API calls. 123 """ 124 if ret != 0: --> 125 raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
LightGBMError: The number of features in data (73802) is not the same as it was in training data (2381). You can set
predict_disable_shape_check=true
to discard this error, but please be aware what you are doing.