nokitoino / DecompilerAI

Converting Assembly back to C Code using Transformers.
GNU General Public License v3.0
25 stars 3 forks source link

MSP T5 Fine Tuning #6

Open pathquester opened 8 months ago

pathquester commented 8 months ago

Projects like CodeT5 use masked span prediction for better context understanding. Do you think this will be necessary?

nokitoino commented 8 months ago

I have had contact to a Malware Analysis group, that had problems to get clean labeled data. They deal with Few-Shot learning, and for this pretraining with masking on the unlabeled data is of importance. In our project we can get huge amount of labeled data (codeparrot/github-code), but I still assume pretraining on C/Assembly respectively will improve the accuracy. We need to conduct experiments for that to answer your question. Currently the LongT5 does its job in pure seq2seq translation, and the results are explainable so far. If it can't deal with certain code, it is not due to the lack of understanding semantics, but due to the lack of information we give. For example we have hidden exact offset values (it will perform bad with array arithmetics), we don't give memory sections, where global variables/strings/structs... are stored, we don't give information about other functions it calls to (it can't know argument types, return type), and many more fundamental problems. The model performs moderate when it comes to control flow. We try to fix the problems now step by step. At the end, one can test different models, do hyperparameter search, try pretraining, and so on...but the core problems are of priority.