Open chenzhongwu opened 6 months ago
I am very curious about how you generate the files in the provided dataset in https://drive.google.com/file/d/1CzNlo8-e4XqrgAME5zHEWEKIQMPga0xl/view?usp=sharing. What methods did you use to process them from what raw datasets? Thanks!
What is the difference between text1 and text2 in train_retriever_sup_unsup.json?
In the unsupervised setting, text1
and text2
are the same. In the supervised setting, text2
is the natural language intent from CoNaLa
and text1
is the description of the function that fulfill the intent.
What methods did you use to process them from what raw datasets?
CoNaLa
provides NL-code
pairs. We use heuristics to extract the functions from the code and find their responding documents. Please see Appendix B of the paper for more detailed descriptions.
Let me know if you want to use similar pipelines to generate more NL-doc-code
tuples. I can provide some more straightforward approach to generate the data.
Thanks a lot! I have some other questions confusing me: 1. Could you explain more about Evaluation metrics: character-level BLEU? 2. What is the metric for Retrieval performance in Table 4? How to evaluate the retrieved docs are right? Thanks again!
character-level BLEU is calculated in this way:
from sacrebleu.metrics import BLEU
bleu = BLEU(tokenize='char')
bleu_score = bleu.corpus_score(pred_list, [src_list]).score
metric_list['bleu_char'] = bleu_score
where pred_list
and src_list
are list[str]
The code to calculate recall is here. Basically, we select top-k
from the retriever and see if the ground-truth is inside the top-k
.
Could you share the code for generating the training data for simcse? What is the difference between text1 and text2 in train_retriever_sup_unsup.json?