microsoft / CodeT

MIT License
590 stars 76 forks source link

Random subsets for reproduction #25

Closed stovecat closed 5 months ago

stovecat commented 8 months ago

Thank you for your wonderful work!

I just ran run_pipeline.py and got missing file errors:

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/random-api-completion.test.jsonl'

As far as I understand, the random- prefix denotes split subsets for evaluation, as referred to in Section 3 of the original paper:

Eventually, a total of 1600 test samples are generated for the line completion dataset.

and

From these candidates, we then randomly select 200 non-repetitive API invocations from each repository, resulting in a total of 1600 test samples for the API invocation completion dataset.

For the purpose of reproduction , I would like to ask you about the following four subsets in utils.py:

class FilePathBuilder:
    api_completion_benchmark = 'datasets/random-api-completion.test.jsonl'
    random_line_completion_benchmark = 'datasets/random-line-completion.test.jsonl'
    # short version for codegen
    short_api_completion_benchmark = 'datasets/random-api-completion-short-version.test.jsonl'
    short_random_line_completion_benchmark = 'datasets/random-line-completion-short-version.test.jsonl'
zfj1998 commented 5 months ago

please refer to the pull request. For permission reasons I cannot merge the pr. https://github.com/microsoft/CodeT/pull/20