github / CodeSearchNet

Datasets, tools, and benchmarks for representation learning of code.
https://arxiv.org/abs/1909.09436
MIT License
2.18k stars 385 forks source link

How to deconstruct code into tokens to extract functions and comments? #234

Closed skye95git closed 3 years ago

skye95git commented 3 years ago

I want to make a code search corpus. I have collected a lots of GitHub repositories. Now I need to deconstruct code into tokens to extract functions and comments. You describe in the paper CodeSearchNet Challenge Evaluating the State of Semantic Code Search: We then tokenize all Go, Java, JavaScript, Python, PHP and Ruby functions (or methods) using TreeSitter — GitHub’s universal parser — and, where available, their respective documentation text using a heuristic regular expression.

I can extract functions in python. But it hasn't comments. How do you extract functions with comments? Can you share your codes?

mallamanis commented 3 years ago

You can find all our parsing code here.

skye95git commented 3 years ago

You can find all our parsing code here.

Thank you for your reply! I have try the function parer in CodeSearchNet/function_parser/ folder. But I met some problems:

  1. What is the input? In the examples, the input is library keras-team/keras. Is it https://github.com/keras-team/keras? But it's a repository. Is it one repository per input?

  2. What is the output? The output in the examples is

{ 'nwo': 'keras-team/keras', 'sha': '0fc33feb5f4efe3bb823c57a8390f52932a966ab', 'path': 'keras/layers/core.py', 'language': 'python', 'identifier': 'Activation.__init__', 'parameters': '(self, activation, **kwargs)', 'argument_list': '', 'return_statement': '', 'docstring': '', 'function': 'def __init__(self, activation, **kwargs):\n super(Activation, self).__init__(**kwargs)\n self.supports_masking = True\n self.activation = activations.get(activation)', 'url': 'https://github.com/keras-team/keras/blob/0fc33feb5f4efe3bb823c57a8390f52932a966ab/keras/layers/core.py#L294-L297' }

The path is just a core.py file in library keras-team/keras. How do I set the file to parse? By dependee = "keras-team/keras"?