src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

[idea] Generic code tokenizer #44

Open EgorBu opened 5 years ago

EgorBu commented 5 years ago

Feature extraction for source code heavily relies on tokenization of source code and structure information in many tasks. If we want to use suggestion feature at GitHub we must use tokenized code. This part is very important for everybody in MLonCode area and still it's quite complicated to do. Proposal - extend bblfsh client or make new module that could be used by many different projects.

TLDR: information required by feature extractor

And this module could be used by many researchers in this area. Related issue https://github.com/bblfsh/bblfshd/issues/231

zurk commented 5 years ago

Right now we have tokenizer for JS in style-analyzer: https://github.com/src-d/style-analyzer/blob/a0eaafd5b371433e3c2e3dc9d113710814912f99/lookout/style/format/feature_extractor.py#L655-L656

If we start this project, this code should be considered as an entry point.

m09 commented 5 years ago

Should this be transferred to src-d/feature-idea? @vmarkovtsev @EgorBu

EgorBu commented 5 years ago

Good idea, @m09

m09 commented 5 years ago

Somehow I cannot transfer it, GitHub does not find the feature-idea repo :confused:

Edit: it seems we need someone admin in both ml-backlog and feature-idea to transfer the issue.

m09 commented 5 years ago

Calling @smola to the rescue to transfer the issue to feature-idea :)