Retriever output of the task code2text missing

z666pr commented 2 years ago

Dear authors,

Thanks for your great work! Recently I've been working on the task code to text. I notice that the data you provide on google drive doesn't contain retriever output of code to text. I can only find _python_csnet_code_text_retrieval_dedup_valid30.json and _retriever_output_codexglue_csnet_text_tocode.zip. Would you like to provide _retriever_output_codexglue_csnet_code_totext.zip which contains retriever output of the task code2text on java&python dataset in csnet?

Thanks!

rizwan09 commented 2 years ago

Hi, Thanks so much for your interests in Redcoder. However, I actually did not plan to release all the files as they were very large together. And then recently graduated :). So not sure if I still have them. Did you try running the code?

z666pr commented 2 years ago

Sorry for replying late. I run the code to retrieve similar comments for validation set of CSNET and I use retrieval_database/retrieval_database/codexglue_csnet/deduplicated.summaries.txt on the google drive as the ctx_file argument in step3. But I found that the file deduplicated.summaries.txt includes comments of CSNET validation set. So for this task, do I need to collect CoeXGLUE-CSNET(trainsets)+C_summarization dataset by myself? or is there any file on the google drive already done the job? thank you :)

rizwan09 commented 2 years ago

Hello! deduplicated.summaries.txt is the collection of CoeXGLUE-CSNET(trainsets)+[C_summarization dataset]

z666pr commented 2 years ago

Hi~ I download CSNET dataset following this again and run the code below to check if CoeXGLUE-CSNET(validset) is included.

import json
from tqdm import tqdm

vjl=[json.loads(j) for j in open("/CSNET/dataset/java/valid.jsonl")]
vjl_doc=[" ".join(_vjl["docstring_tokens"]) for _vjl in vjl]
dsum=[s for s in open("/REDCODER/retrieval_database/codexglue_csnet/deduplicated.summaries.txt")]

cnt = 0
for i,docs in tqdm(enumerate(vjl_doc)):
    for summ in dsum:
        if docs+'\n' == summ:
            cnt = cnt + 1

print(cnt)

Finally cnt is 5183, which means all the docstring in CoeXGLUE-CSNET(validset) can be found in deduplicated.summaries.txt. I'm confused about that.

rizwan09 commented 2 years ago

Oh I got your point. Did you try to do it for the test set or other langauges?

Anyway, even if the data is there we manually filter it when data preprocesing like with or without the target summary when retrieving and feeded into SCODE-G as was needed for Redcoder e.g., https://github.com/rizwan09/REDCODER/blob/main/SCODE-G/text_to_code/process.py#L71

So even in case if consists, perhaps you can do another round of preprocessing.

z666pr commented 2 years ago

Hi~ Now I filtered the file deduplicated.summaries.txt to make sure that no comments of valid/test set left there, namely filtered.summaries.txt. And then I follow step2&3 of SCODE-R, to retrieve similar comments for CoeXGLUE-CSNET(valid&test set). But the retrieval result doesn't match the paper, where Table4 says SCODE-R's BLEU for CoeXGLUE-CSNET-Java is 15.87, and my result is about 11.66 and 11.80. Here is the bash file I used in step2&3, I wonder if I did something wrong:

Step2


DEVICES=3
NUM_DEVICES=1
CHECKPOINT=REDCODER/SCODE-R/checkpoints/SCODE_R_CODE_TEXT_JAVA.cp
CANDIDATE_FILE=REDCODER/retrieval_database/codexglue_csnet/filtered.summaries.txt
ENCODDING_CANDIDATE_PREFIX=REDCODER/SCODE-R/embedding/fst_emb
PRETRAINED_MODEL_PATH=pretrained_models/graphcodebert-base/

CUDA_VISIBLE_DEVICES=${DEVICES} python -m torch.distributed.launch \ --nproc_per_node=${NUM_DEVICES} generate_dense_embeddings.py \ --model_file ${CHECKPOINT} \ --encoder_model_type hf_roberta \ --pretrained_model_cfg ${PRETRAINED_MODEL_PATH} \ --batch_size 512 \ --ctx_file ${CANDIDATE_FILE} \ --shard_id 0 \ --num_shards 1 \ --out_file ${ENCODDING_CANDIDATE_PREFIX} \ --code_to_text

- Step3 (for test set)

DEVICES=3 TOP_K=10 RETRIEVAL_RESULT_FILE=REDCODER/SCODE-R/data/test_r${TOP_K}.json CHECKPOINT=REDCODER/SCODE-R/checkpoints/SCODE_R_CODE_TEXT_JAVA.cp CANDIDATE_FILE=REDCODER/retrieval_database/codexglue_csnet/filtered.summaries.txt ENCODDING_CANDIDATE_PREFIX=REDCODER/SCODE-R/embedding/fst_emb_0.pkl PRETRAINED_MODEL_PATH=pretrained_models/graphcodebert-base/ FILE_FOR_WHICH_TO_RETIRVE=/data/csnetg_retrieval_bleu_jsonl/java/test.jsonl

CUDA_VISIBLE_DEVICES=${DEVICES} python dense_retriever.py \ --model_file ${CHECKPOINT} \ --ctx_file ${CANDIDATE_FILE} \ --qa_file ${FILE_FOR_WHICH_TO_RETIRVE} \ --encoded_ctx_file ${ENCODDING_CANDIDATE_PREFIX} \ --pretrained_model_cfg ${PRETRAINED_MODEL_PATH} \ --out_file ${RETRIEVAL_RESULT_FILE} \ --n_docs ${TOP_K} \ --sequence_length 256 \ --code_to_text \ --save_or_load_index


Thank you!

rizwan09 commented 2 years ago

closing as no activities.

rizwan09 / REDCODER

Retriever output of the task code2text missing #5