microsoft / CodeBERT

CodeBERT
MIT License
2.19k stars 450 forks source link

App clone detection #289

Open emreaydogan opened 1 year ago

emreaydogan commented 1 year ago

Hi,

1- I am currently undertaking a project where I intend to determine if two applications (not just code parts or functions) are similar. In light of this, I am considering the usage of GraphCodeBert or UnixCoder for my project. Could you kindly advise which of the two would be more suitable for my purposes?

2- Currently, I am using GraphCodeBert and following this issues https://github.com/microsoft/CodeBERT/issues/53 because the problem is not binary but multiclass problem. I change the tokenizer_name and model_name_or_path to microsoft/graphcodebert-base from microsoft/codebert-base in train.sh and inference.sh.

I have a dataset of Android application collected from different Android markets. I don't use any benchmark dataset. I extracted Java functions from the source code and convert them into train.jsonl, valid.jsonl, and test.jsonl datasets. The real training dataset can be accessed here: train.txt. (Please change the file extension from .txt to .jsonl before use.)

The dataset format is as follows:

{"code": "func1 for app0", "label": 0} {"code": "func2 for app0", "label": 0} {"code": "func1 for app1", "label": 1} {"code": "func2 for app1", "label": 1} {"code": "func1 for app2", "label": 2} {"code": "func2 for app2", "label": 2} {"code": "func3 for app2", "label": 2}

I successfully run the train.sh. But in the eval section (eval_acc), it gives me an accuracy of about 0.3. When I tried the trained-by-me model in test dataset, it gives me an accuracy of about 0.7.

I am unsure if my approach is correct or if there might be improvements needed. I would deeply appreciate any feedback or suggestions you could provide.