src-d / ml-core

source{d} MLonCode foundation - core algorithms and models.
Other
14 stars 16 forks source link

How to extract UAST's path contexts from code? #35

Closed alonsopg closed 5 years ago

alonsopg commented 5 years ago

Given a python piece of code:

print("hello")

How can I extract its associated UAST contexts paths?, I tried to explore sourced library and see if there's a function for extracting the paths of a piece of code:

In:

from sourced.ml.core.extractors import bags_extractor
bags_extractor.Extractor?

Out:

Init signature: bags_extractor.Extractor(log_level=20)
Docstring:     
Converts a single UAST via `algorithm` to anything you need.
It is a wrapper to use in `Uast2Features` Transformer in a pipeline.
Init docstring:
Class constructor
:param log_level: logging level.
File:           ~/anaconda3/envs/sourced/lib/python3.6/site-packages/sourced_ml_core-0.0.3-py3.6.egg/sourced/ml/core/extractors/bags_extractor.py
Type:           type
Subclasses:     BagsExtractor, RoleIdsExtractor

and

In:

from sourced.ml.core.extractors import Extractor
Extractor.extract?

Out:

Signature: Extractor.extract(self, uast:bblfsh.node.Node)
Docstring: <no docstring>
File:      ~/anaconda3/envs/sourced/lib/python3.6/site-packages/sourced_ml_core-0.0.3-py3.6.egg/sourced/ml/core/extractors/bags_extractor.py
Type:      function

Also, I tried to:

from sourced.ml.core.utils import bblfsh
bblfsh.BblfshClient.parse(filename='/home/user/Downloads/script.py')

But I got:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-9dbc87e2a353> in <module>
      1 code = '''print("hi")'''
----> 2 bblfsh.BblfshClient.parse(filename='/home/user/Downloads/script.py')

TypeError: parse() missing 1 required positional argument: 'self'

After checking the available modules, I did not find a function for extracting the UAST path contexts. Is it possible to do it with sourced? Also I did not find any documentation (maybe someone can give me a pointer about where to find the docs?).

EgorBu commented 5 years ago

Hello @alonsopg,

Long story short - you need to instantiate BblfshClient:

from sourced.ml.core.utils import bblfsh
bblfsh.BblfshClient("0.0.0.0:9432").parse(filename='/home/user/Downloads/script.py')

Just you need to be sure that bblfsh server is launched. If something goes wrong - please follow these steps: 1) launch bblfsh server - you may find description here 2) install drivers for languages docker exec -it bblfshd bblfshctl driver install --recommended (or you can install only languages you need) 3) Install bblfsh client if it's not installed yet 4) Parse a code as I mentioned above 5) enjoy :smile:

please, let us know if you still have some problems

alonsopg commented 5 years ago

Thanks for the help, and sorry for the long story! However, when I do tried that, my jupyter kernel dies, I am getting: The kernel appears to have died. It will restart automatically. Is that normal? Also I tried to do it in the terminal, and I am getting this output. Which is just a dictionary with the UAST tags, is there any way of getting flat sequences of contexts paths?

EgorBu commented 5 years ago

Can you explain the meaning/description of contexts paths? Can you give an example of what you want to get?

About The kernel appears to have died. It will restart automatically - can you open an issue in https://github.com/bblfsh/python-client/issues with details about your system? Because I can launch bblfsh client from Jupyter Notebook without any problems.

alonsopg commented 5 years ago

Thanks for the response! With context paths I mean the the node to node walks across the tree. For example, graphically like this:

context path

For example, some of the context paths for the above tree would be:

"print", "expr", "call", "expr", "binop", "num", "4"
"print", "expr", "call", "expr", "binop", "mult"
"print", "expr", "call", "expr", "binop", "num", "3"

Is there any way of extracting this paths with ml-core?

EgorBu commented 5 years ago

As I understand it's from code2vec article - is it correct?

You may find some related work here - it has function get_paths that should do this.

alonsopg commented 5 years ago

Yes I am trying to use the code2vec model. I tried to do that, however, I thought sourced implemented the same models or some wrapper. Should I ask questions related to code2vec and paths function here or in the code2vec model?

EgorBu commented 5 years ago

TBH - code2vec is out of scope for ml-core repository. It's better to ask in https://github.com/src-d/code2vec - but it's rather dead than alive. Can you tell the scope/area of interest in your experiments? Reproduce code2vec? There is an official repo btw

alonsopg commented 5 years ago

I am using it for code obfuscation. Thanks for the help! I will check in the other repo!