hvdthong / DeepJIT_updated

11 stars 7 forks source link

Train and Test pkl seems to have missing added and removed code #1

Open Manas-Embold opened 3 years ago

Manas-Embold commented 3 years ago

import pickle data = pickle.load(open('/content/openstack_train.pkl', 'rb')) ids, labels, msgs, codes = data

If we look at codes, it does't contain actual added and removed code

[['added code removed code', 'added code removed code', 'added code removed code', 'added code removed code', 'added code removed code', 'added code removed code', 'added code removed code'], ['added code removed code', 'added code removed code', 'added code removed code', 'added code removed code', 'added code removed code', 'added code removed code'], ['added code removed code', 'added code removed code'],

Manas-Embold commented 3 years ago

Predictions run and is still able to produce AUC Score of 74

100% 21/21 [00:01<00:00, 11.19it/s] Test data -- AUC score: 0.7486763593579291

How is it happening without data?

Manas-Embold commented 3 years ago

If i try to run the same code on pkl from jit task of cc2vec which contains added code and removed code i get following error. Link to data for jit task of cc2vec: https://zenodo.org/record/3965149#.X2VeP5MzY1J pkl file: data/jit/openstack_train.pkl

Error: Traceback (most recent call last): File "main.py", line 58, in pad_code = padding_data(data=codes, dictionary=dict_code, params=params, type='code') File "/content/padding.py", line 36, in padding_data pad_code = padding_commit_code(data=data, max_line=params.code_line, max_length=params.code_length) File "/content/padding.py", line 61, in padding_commit_code padding_length = padding_commit_code_length(data=data, max_length=max_length) File "/content/padding.py", line 66, in padding_commit_code_length return [padding_multiple_length(lines=commit, max_length=max_length) for commit in data] File "/content/padding.py", line 66, in return [padding_multiple_length(lines=commit, max_length=max_length) for commit in data] File "/content/padding.py", line 18, in padding_multiple_length return [padding_length(line=l, max_length=max_length) for l in lines] File "/content/padding.py", line 18, in return [padding_length(line=l, max_length=max_length) for l in lines] File "/content/padding.py", line 21, in padding_length line_length = len(line.split()) AttributeError: 'dict' object has no attribute 'split'

hvdthong commented 3 years ago

Many thanks for your comments. I think I uploaded the wrong dataset. Let me try to find the correct one and I will let you know. Thank you and have a great day.

Manas-Embold commented 3 years ago

That would be super helpful. Many Thanks.

Manas-Embold commented 3 years ago

Hi Thong,

Did you get a chance to find the correct dataset. Or can you guide me how to convert the openstack_train.pkl to openstack_train_DExtended.pkl

I would be thankful for the same.

hvdthong commented 3 years ago

Hi Manas,

I'm a bit busy now as I have some deadlines which are coming. Moreover, because of COVID-19, I have to work at home. I will find the time to get the data. At this moment, I don't know how to convert the openstack_train.pkl to openstack_train_DExtended.pkl as their formats are different. Thank you and have a great day.

jiojio718 commented 2 months ago

有正确数据集吗?