kiteco / kiteco-public

Primary Kite repo — private bits replaced with XXXXXXX
BSD 3-Clause "New" or "Revised" License
720 stars 176 forks source link

This is a public version of the main Kite repo

The main Kite repo (originally kiteco/kiteco) was intended for private use. It has been lightly adapted for publication here by replacing private information with XXXXXXX. As a result many components here may not work out of the box.

Intro to parts of the codebase

Q: How did we analyze all the code on Github?

We used a variety of infrastructure, on a mix of cloud platforms depending on what was the most economical, though it was mostly on AWS.

We used mostly map-reduce to manage workflows that needed to be run over large datasets. You can see a list of some of our map-reduce jobs here (local-pipelines) and here (emr-pipelines). I believe tasks in local-pipelines are intended to be ran on single instances whereas EMR is AWS's map-reduce infrastructure.

Here are some example tasks, with a particular focus on Python analysis:

Several return type sources are unified in this command.

A lot of this pipeline seems to be orchestrated through this Makefile. This is broadly documented a bit here.

This pipeline results in a number of files per package::version, with the following elements:

I'm attaching (see the readme_assets folder) the final resource build for numpy here as "resource-manager-numpy.zip". You can download the 800MB zip file with all the Python open-source packages here.

The bullet list of resources above is from the code here. You can "find references" to see how these files get loaded from disk. In the Kite client the resource manager's main entry point is here. Note this class includes code for dynamically loading and unloading packages' data into memory to conserve end-user memory.

By the way, we are happy to share any of our open-source-derived data. Our Github crawl is about 20 TB, but for the most part the intermediate and final pipeline outputs are pretty reasonably-sized. Although please let me know soon if you want anything because we will likely end up archiving all of this.

To reiterate, we invested a few $million into our Python tech, so you should find it to be pretty robust and high quality, which is why I'm doing some moonlight work to try to give it a shot at not getting lost.

Q: Is this infrastructure incremental?

Generally, no. Fortunately it didn't really need to be. I can't recall how long it took to run the full Python analysis end to end --- it was more than a day but I think less than a week.

Q: How often did you re-run data collection and analysis of GitHub code?

We ran several Github crawls throughout our time. I think there were something like ~4 successive crawls during a ~7 year period. Things do change, but not super frequently. The other Python package exploration is much cheaper to run so we ran it more often.

Q: How do you deploy your ML models?

Here are some highlights:

Q: How did you measure the quality of your models?

I'm not sure I can shed much light here, but here's a rough pass:

In terms of the infrastructure and code:

Btw we also trained a simple model to mix lexical/GPT-2 and other completions. (short product spec attached as "Product Spec_ Multi-provider Completions.pdf")

(Bonus: I'm attaching (see the readme_assets folder) our product spec for lexical completions here as "Product Spec_ Lexical Completions.pdf")

Q: Did you implement your own parsers or reuse existing ones?

We implemented our own Python parser in Golang. It is robust to syntax errors, e.g. it can parse partial function calls. It can be found here.

We also did some parser / formatter work with JavaScript, but did not finish it. We ended up using treesitter for some things after it came out.

Q: Could you do code linting and refactorings, given that the data about API usages you collect is never complete?

We did not try to do this very much. We did some experimentation with linting, but to your point having a noisy linter can be worse than no linter at all. I think it's harder to use ML for linting than completions or other use cases for this reason.

Q: Did you try to pivot to other usages of ML code analysis like automatic code reviews, security checks, etc?

Yes we did some experimentation on a number of different ideas in late 2020 / early 2021.

Synthesizing status summaries: From an ML perspective, the idea is to use Github PR titles to train a model that can generate "PR titles" from code changes, thus enabling us to make it easy for developers to share descriptions of the work they've been doing more easily.

Screen Shot 2022-01-09 at 9.48.13 PM.png

Screen Shot 2022-01-09 at 9.48.22 PM.png

Screen Shot 2022-01-09 at 9.48.29 PM.png

[Originally for Kite employees] Getting started with the codebase

Our codebase is primarily located at github.com/kiteco/kiteco (http://github.com/kiteco/kiteco). There are a few auxiliary repositories that host very experimental code, but the goal is to make the “kiteco” repository the point of truth for all of our services.

Summary (TL;DR)

Git LFS

We use Git LFS to store our various bindata.go files. You will need to install the command line tool to get the contents of those files when you pull the repository. Installation instructions are on their website, but for MacOS you can install it by running (from inside the kiteco repository)

brew update
brew install git-lfs
git lfs install

Then do a git pull to get the bindata.go files. If they do not download from LFS, try running git lfs pull (you should only need to do this once - subsequent git pulls should update the bindata correctly).

Optional: Improving Performance

git lfs install installs a smudge filter that automatically downloads and replaces the contents of newly checked out "pointer files" with their content. By default smudge filters operate on checked out blobs in sequence, so cannot download in batch as would typically happen when running git lfs pull. Furthermore, by default, git checkouts will block on downloading the new LFS files which can be annoying. You might prefer to disable the smudge filter (this can be run even if you've already run the regular git lfs install):

git lfs install --skip-smudge
git lfs pull

Then, when building after a new checkout, you may see an error of the form "expected package got ident." This occurs because go reads some Go files and sees the Git LFS pointers instead of the actual data file. At this point, you can download the latest files with git lfs pull and rebuilding should work.

Nothing needs to be done when pushing LFS blobs. That will still happen automatically.

Go

The bulk of our code is currently in Go. This can be found at github.com/kiteco/kiteco/kite-go (http://github.com/kiteco/kiteco/kite-go). To get started working in this part of the codebase, first make sure you have your Go environment setup correctly (i.e Go is installed, $GOPATH is set, etc.).

Locally, however, you will need to install Go 1.15.3. The following steps will get you going.

Set $GOPATH in your .profile / .bashrc/ .bash_profile / .zshrc, e.g:

export GOROOT=/usr/local/go
export GOPATH=$HOME/go
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin

Make sure to create these directories as well:

mkdir $HOME/go
mkdir $HOME/go/src $HOME/go/bin $HOME/go/pkg

If you are on a Mac and set the above in either .bashrc or .zshrc, make sure to load it in either your .profile or .bash_profile. See this for an explanation.

It would be useful to become familiar with how go code is organized. Check out https://golang.org/doc/code.html for more on this topic.

Navigate to where the kiteco repo will live in your GOPATH, and clone the repo.

# Create kiteco directory within GOPATH, and clone the repo there
mkdir -p ~/go/src/github.com/kiteco
cd ~/go/src/github.com/kiteco
git clone git@github.com:kiteco/kiteco

To install the latest version of Go that's compatible with our codebase, run:

cd ~/go/src/github.com/kiteco/kiteco
cd devops/scripts
./install-golang.sh

From here, just run make install-deps from the root of the kiteco repo to get basic utilities installed.

# Install dependencies
make install-deps

Use ./scripts/update-golang-version.sh if you'd like to make Kite require a newer version of Golang.

Tensorflow

For development builds (see below), you may need to have Tensorflow installed globally on your system.

make install-libtensorflow

Building Kite

You're now ready to build Kite! First, build the sidebar for your platform

./osx/build_electron.sh force
# ./linux/build_electron.sh force
# ./windows/build_electron.sh force

This process is asynchronous to the Kite daemon build, so you must manually rebuild the sidebar as needed.

Now build and run Kite:

make run-standalone

Note that this is not a full Kite build, but is the recommended approach for development, as it is much faster. Some functionality is disabled in the development build (depending on the platform):

Development

You should be able to develop, build, and test Kite entirely on your local machine. However, we do have cloud instances & VMs available for running larger jobs and for testing our cloud services

Dependency Management with Go Modules

We use the Go Modules system for dependency management.

General tips:

To add or update a dependency, all you need to do is go get it, which will automatically update the go.mod and go.sum files. To remove a dependency, remove references to it in the code and run go mod tidy. In general, make sure to run go mod tidy to make sure all new dependencies have been added and unused ones have been removed before committing any dependency changes.

The process for updating a dependency is:

The process for adding a dependency is:

HTTPS Auth

godep may attempt to clone private repositories via HTTPS, requiring manual authentication. Instead, you can add the following section to your ~/.gitconfig in order to force SSH authentication:

[url "git@github.com:"]
    insteadOf = https://github.com/

Datasets, Datadeps

We bundle a lot of pre-computed datasets & machine learning models into the Kite app through the use of a custom filemap & encoding on top of go-bindata. The data, located in kite-go/client/datadeps, is kept in Git-LFS.

All needed data files is first stored on S3. There are pointers at various places in our codebase to S3 URIs. After updating references to these datasets, the datadeps file must be manually rebuilt:

$ ./scripts/build_datadeps.sh

This will bundle all data that is loaded at Kite initialization time. You must ensure the needed data is loaded at initialization, otherwise it will not be included!

Logs

Some logs are displayed in Xcode, but most are written to a log file:

tail -F ~/.kite/logs/client.log

Testing and Continuous Integration

Your Go code should pass several quality criteria before being allowed into the master branch. Travis CI (https://travis-ci.org/) acts as the gatekeeper between pull requests and merging. You can test your code before pushing to a pull request to speed up the process by navigating to the kite-go directory and running make * commands directly (any of make (fmt|lint|vet|bin-check|build|test)).

VPN Access

You will need access to our VPN to connect to our backend hosts.

SSH Access

Kite's Dropbox has ssh credentials for all the machines on AWS and Azure under Shared > Engineering > keys > kite-dev.pem and Shared > Engineering > keys > kite-dev-azure. Place both of these in your .ssh directory, i.e. ~/.ssh/kite-dev.pem. As a convenience, you should add the following to your ~/.ssh/config:

Host *.kite.com
    ForwardAgent yes
    IdentityFile ~/.ssh/kite-dev.pem
    User ubuntu

# Test instances are on Azure
Host test-*.kite.com
    User ubuntu
    IdentityFile ~/.ssh/kite-dev-azure

Don't forget to set appropriate permissions on the credential files (e.g. 700)