Slides are online.
OSS tools covered:
Machine Learning on Source Code (MLonCode) is an emerging research domain which stands at the intersection of deep learning, natural language processing, software engineering and programming language communities.
During this 3h30 workshop, we will review recent Software Engineering tasks that benefit from applying Machine Learning, with a focus on hands-on experience on:
- extracting data from real source code
- developing multiple Machine Learning models
- for a particular task of source code summarization (or function name suggestion).
At the end of the workshop participants will build 2 working models on a real dataset, producing near state-of-the-art results. Practical skill of extracting information from source code as well as modelling different aspects of it are going to be acquired.
Prerequisites: familiarity with the basics of DeepLearning, a laptop with Docker installed
Import Docker images (works offline):
docker load -i images/jupyter.tgz
docker load -i images/gitbase.tgz
docker load -i images/bblfshd-with-drivers.tgz
docker images
Run bblfsh
docker run \
--detach \
--rm \
--name amld_bblfshd \
--privileged \
--publish 9432:9432 \
bblfsh/bblfshd:v2.15.0-drivers \
--log-level DEBUG
Run gitbase
docker run \
--detach \
--rm \
--name amld_gitbase \
--publish 3306:3306 \
--link amld_bblfshd:amld_bblfshd \
--env BBLFSH_ENDPOINT=amld_bblfshd:9432 \
--env MAX_MEMORY=1024 \
--volume $(pwd)/repos/git-data:/opt/repos \
srcd/gitbase:v0.24.0-rc2
Run the jupyter image
docker run \
--rm \
--name amld_jupyter \
--publish 8888:8888 \
--link amld_bblfshd:amld_bblfshd \
--link amld_gitbase:amld_gitbase \
--volume $(pwd)/notebooks:/amld/notebooks \
--volume $(pwd)/repos:/amld/repos \
mloncode/amld
We are going to use top 50 repositories from Apache Software Foundation though this workshop.
Notebook 1: data collection pipeline (example)
Build a vector model for projects and developers using Topic Modelling of code identifiers.
Notebook 2: project and developer similarities (example)
Train a NMT seq2seq model for predicting method names based on identifiers in method bodies.