pralab / secml_malware

Create adversarial attacks against machine learning Windows malware detectors
https://secml-malware.readthedocs.io/
GNU General Public License v3.0
203 stars 46 forks source link

can't attack EMBER model #33

Closed cease2e closed 2 years ago

cease2e commented 2 years ago

Describe the bug When EMBER predicts malware, it prompts that LightGBMError: The number of features in data (73802) is not the same as it was in training data (2381).

I searched for this bug and found that the version of lief did not match. But python 3.9 can't install lief==0.9.0

To Reproduce image

Expected behavior The malware can be predicted normally by EMBER.

Library info lief == 0.12.1 python == 3.9.12

System info (please complete the following information):

Additional context

The following is the detailed error reporting information

LightGBMError Traceback (most recent call last) Input In [18], in <cell line: 6>() 18 max_length = max(max_length, len(code)) 19 x = CArray(np.frombuffer(code, dtype=np.uint8)).atleast_2d() ---> 20 print(net.predict(x, return_decision_function=False))

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\classifiers\c_classifier.py:293, in CClassifier.predict(self, x, return_decision_function) 266 def predict(self, x, return_decision_function=False): 267 """Perform classification of each pattern in x. 268 269 If preprocess has been specified, (...) 291 292 """ --> 293 scores = self.decision_function(x, y=None) 295 # The classification label is the label of the class 296 # associated with the highest score 297 labels = scores.argmax(axis=1).ravel()

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\classifiers\c_classifier.py:222, in CClassifier.decision_function(self, x, y) 194 def decision_function(self, x, y=None): 195 """Computes the decision function for each pattern in x. 196 197 If a preprocess has been specified, input is normalized (...) 220 221 """ --> 222 scores = self.forward(x, caching=False) 223 return scores if y is None else scores[:, y].ravel()

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\secml\ml\c_module.py:204, in CModule.forward(self, x, caching) 202 # Transform data using inner preprocess, if defined 203 x = self._forward_preprocess(x=x, caching=caching) --> 204 return self._forward(x)

File ~\Desktop\secml_malware\secml_malware\models\c_classifier_ember.py:63, in CClassifierEmber._forward(self, x) 61 def _forward(self, x): 62 x = x.atleast_2d() ---> 63 scores = self._lightgbm_model.predict(x.tondarray()) 64 confidence = [[1 - c, c] for c in scores] 65 confidence = CArray(confidence)

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:3538, in Booster.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape, **kwargs) 3536 else: 3537 num_iteration = -1 -> 3538 return predictor.predict(data, start_iteration, num_iteration, 3539 raw_score, pred_leaf, pred_contrib, 3540 data_has_header, is_reshape)

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:848, in _InnerPredictor.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape) 846 preds, nrow = self.__pred_for_csc(data, start_iteration, num_iteration, predict_type) 847 elif isinstance(data, np.ndarray): --> 848 preds, nrow = self.__pred_for_np2d(data, start_iteration, num_iteration, predict_type) 849 elif isinstance(data, list): 850 try:

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:938, in _InnerPredictor.__pred_for_np2d(self, mat, start_iteration, num_iteration, predict_type) 936 return preds, nrow 937 else: --> 938 return inner_predict(mat, start_iteration, num_iteration, predict_type)

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:908, in _InnerPredictor.__pred_for_np2d..inner_predict(mat, start_iteration, num_iteration, predict_type, preds) 906 raise ValueError("Wrong length of pre-allocated predict array") 907 out_num_preds = ctypes.c_int64(0) --> 908 _safe_call(_LIB.LGBM_BoosterPredictForMat( 909 self.handle, 910 ptr_data, 911 ctypes.c_int(type_ptr_data), 912 ctypes.c_int32(mat.shape[0]), 913 ctypes.c_int32(mat.shape[1]), 914 ctypes.c_int(C_API_IS_ROW_MAJOR), 915 ctypes.c_int(predict_type), 916 ctypes.c_int(start_iteration), 917 ctypes.c_int(num_iteration), 918 c_str(self.pred_parameter), 919 ctypes.byref(out_num_preds), 920 preds.ctypes.data_as(ctypes.POINTER(ctypes.c_double)))) 921 if n_preds != out_num_preds.value: 922 raise ValueError("Wrong length for predict results")

File ~\anaconda3\envs\secml_malware_env\lib\site-packages\lightgbm\basic.py:125, in _safe_call(ret) 117 """Check the return value from C API call. 118 119 Parameters (...) 122 The return value from C API calls. 123 """ 124 if ret != 0: --> 125 raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))

LightGBMError: The number of features in data (73802) is not the same as it was in training data (2381). You can set predict_disable_shape_check=true to discard this error, but please be aware what you are doing.

zangobot commented 2 years ago

Hello! EMBER has a feature extraction phase. You should convert malware into a feature vector before feeding it to the model.