kamahen / pykythe

Generate code Python source cross-reference facts in Kythe format
Other
21 stars 3 forks source link

Python indexer (for Kythe)

The pykythe package provides code for preprocessing Python source code into a representation for easy entity cross-referencing, using entities in the style of the Kythe indexing schema.

License

Apache 2.0

Warning, Avis, Achtung,ご注意

This code is pre-alpha and is intended for collaboration with other people, to create an industrial-strength indexer for Python. The author intends to make significant changes and improvements to the code, so if you want to work on it, please contact peter.ludemann@gmail.com first.

Debugging note

Sometimes pykythe.qlf can output an "unexpectedly failed" error message without enough context to figure out where the failure happened. To track this down, run the command again, but with the /tmp/pykythe_test/pykythe.qlf replaced by swipl -l pykythe/pykythe.pl; the run two commands:

debug.
pykythe:pykythe_main2.

Browsing

Experimental source browser

The browser directory contains code for browsing the source with hyperlinks. (See below under "Run the Kythe browser")

The code in ast_color.py creates "color" facts that have colorization information for the source (e.g., token, string, whitespace, comment). The Kythe schema is augmented by /pykythe/color/... facts.

A simple server loads the Kythe facts and makes them available to the front-end using Javascript fetch.

You can process the test files and run the browser by make SRC_BROWSER_PORT=9999 test make-json run-src-browser

Browsing with underhood

Pykythe works with underhood to allow source code browsing. A rough outline of how to set up the servers in the Makefile rule run-underhood-all (after running add-index-pykythe or make-tables to set up the serving tables).

Browsing with the Kythe browser

The Kythe browser is obsolete and unsupported by the Kythe team. The "underhood" browser uses the Kyth server (which is still supported) with a new front-end. Alternatively, an experimental source browser is included in pykythe.

Competition

Pytype has an experimental indexer, which runs at least 10x slower and crashes on some inputs. (Pykythe does about 100-200 files per minute on a 4 CPU workstation and can process the entire Python3 library without crashing.) Pytype also requires a system such as Bazel with associated BUILD files, or ninja, to run in parallel.

Run demo

Installation

There is no installation script, because this code is pre-alpha. If you run scripts/demo.sh, then most of the prerequisites are loaded (with the exception of the Kythe code).

The following probably has some wrong information about git and submodules. Finding accurate up-to-date documentation is nearly impossible and I really don't feel like becoming a git guru.

Coding conventions

Code formatting

All the Python code is formatted using yapf configured with .style.yapf. You can either install it from github or using pip. The Makefile has a rule pyformat that formats everything.

Prolog code is formatted according to the recommendations in plcoding.pdf with an extension for EDCGs that shows the accumulators.

If you use Emacs, I recommend using prolog.el from https://bruda.ca/emacs/prolog_mode_for_emacs (https://bruda.ca/_media/emacs/prolog.el), customized with prolog-align-small-comments-flag set to nil (a copy of this is in directory emacs). See also https://www.swi-prolog.org/FAQ/GnuEmacs.html

Type declarations

The code is processed with mypy (using the Makefile rule mypy) and pylint. It is intended to also be processed by pytype.

Implementation notes

Grammar

Pykythe depends on the deprecated lib2to3 parser, and the details of the Grammar.txt file.

This requires Python version 3.11.2.

If you can't install it, you can create it in $HOME/.local/bin/python3.11 by the following:

git clone --depth=1 git@github.com:python/cpython.git
git checkout v3.11.2
./configure --prefix=$HOME/.local --enable-optimizations
make -j8
make -j8 test
make install

Processing a single source file

A source file is processed in the following steps:

Symtab

A fully qualified name is the absolute path to an abstract entity, using '.'s to separate the path items.

The symbol table ("symtab") is a mapping of fully qualified names (FQNs) to their "types". (The word "type" is used a bit loosely; essentially it is a term such as class_type(FQN, Bases) or module_type(module_and_token(Module,Path,Token)) This is different from the usual definition of symtab, which maps a single ID to to a type and where there is a "stack" of IDs of the various scopes (the FQN resolution is done in two passes in the Python program that parses the source and outputs an AST).

For example, in file foo/bar.py:

import os
sep = os.sep

the symtab would contain entries like the following (note that os results in both a definition in the namespace of module foo.bar and a reference to the imported module in typeshed).

'foo.bar.os': module_type(module_alone(
    '${FQN_TYPESHED}.os',
    '${FQN_TYPESHED}/stdlib/3/os/__init__.pyi'))
'.foo.bar.sep': class_type('${FQN_TYPESHED}.stdlib.builtins.str', [])
'${FQN_TYPESHED}.os': module_type(module_alone(
    '${FQN_TYPESHED}.os',
    '${FQN_TYPESHED}/stdlib/os/__init__.pyi'))
'${FQN_TYPESHED}.os.sep': class_type('${FQN_TYPESHED}.stdlib.builtins.str', [])

A symtab is self-contained: there is no need to import the symtabs of modules that it imports. However, the imports' symtabs need to be checked (recursively) to ensure that they are up-to-date; if any of the imports's symtabs is out of date, the entire chain of symtabs need to be recomputed.

When an "import " is processed, the symtab entries that start with the module name are added to the importing module's symtab (e.g., if foo.py has `from bar import , then all symtab entries that start withpath.to.barare added to the symtab, with thepath.to.bar replaced bypath.to.foo`).

Builtins are added to the "scope" of each program, as if there were a from builtins import * added to each program. This is not how Python actually resolves builtin name lookup, but the effect is essentially the same (with a few corner cass that probably don't matter for real programs; after all pykythe can't figure out everything anyway, because of Python's dynamic nature). In this way, we avoid having a stack of symtabs.

Imports, cache, and batches

When a source file is processed, all the imports must be processed first. Pykythe doesn't depend on information about dependencies from a build system (which typically aren't available for Python, in comparison to C++ or similar, which have dependency information in a build system or compiler (e.g., gcc-M), although importlab could be used. Additionally, there can be circular imports, which prevent a strict ordering of dependencies. (Python doesn't directly allow circular imports; but they can be simulated by putting import statements inside functions, so that they are evaluated at run time rather than when the module is first compiled.)

Pykythe processes each import statement as it is encountered. Circular imports are handled by keeping track of all imports that are "in process" and skipping them when they are encountered a second time (more details on this are in the Symtab section).

When processing a codebase, reprocessing the imports can easily get O(N3) behavior (where N is the number of lines of code), so a cache is used to avoid this. For each .py or .pyi file, a corresponding .kythe.json file (and .kythe.entries) is created, containing all the Kythe facts; the the pykythe symtab and a hash of the source file go into a .pykythe.symtab file. When an import is encountered, the cache file is used if possible:

If any of the imported cache files is not useable, all source files that depend on it must be reprocssed to generate new cache files (that is, if a.py imports b.py, b.py imports c.py, and c.py imports d.py; and d.pykythe.symtab is useable but c.pykythe.symtab is not (because c.py has changed since c.pykythe.symtab was created), then c.py, b.py, and a.py must be reprocessed (d.py can be reused as-is).

This kind of caching reduces the algorithm to O(M2) (where M is the number of files), which is still a problem. A further refinement reduces the cost to O(N). Assuming that we can snapshot the code base (which is identified by a "batch id", given with the --batch_suffix command-line option, typically a nano-second time stamp plus a random number), then a single pass over all the source files suffices.

To sum up, pykythe uses two kinds of caches:

Multiple pykythe processes can run at the same time; all output is "atomic", as long as it's all on the same file system. (That is, if the same file is being processed by two processes at the same time, their outputs won't interfere with each other, because they should both have the same content and they are written atomically; that is if the same thing is done multiple times, it doesn't matter because the actions are "idempotent".)

(A detail: to allow for read-only source trees, the cache files are in a separate directory, as specified by --kytheout. This could in future be generalized by specifying patterns for transforming an input file to an output file, e.g. kytheout_pattern='s!/stuff/to/remove/(.*/)([^/]*)\..*}!/path/to/dir/\\1\\2.kythe.json!' ... this probably should allow multiple patterns, to accomodate files in multiple repositories.)

To give an idea of performance improvements, when processing the 7071 files in the Python library (this is total CPU time; wall time was roughly 1/3 on a 4 CPU machine (with SSD, so the affects of cache are less than with HDD) running up to 8 pykythe processes in parallel):

Base64

For the JSON version of Kythe facts, base64 encoding is used, to handle the some data being pure binary and not UTF8 (e.g., if the /kythe/text fact is in Latin-1 encoding, which is not legal UTF-8). Unfortunately, Prolog's base64 support isn't super fast, nor is its UTF-8 encoding/decoding (the latter is needed because base64 works on ASCII only, so encoding/decoding from Unicode is needed).

For performance reasons, the Kythe facts are all created as regular strings (or atoms), and converted to base64 only on output. This is because each pass over the AST creates Kythe facts, and there's a significant saving by only doing the base64 translation once. (Some micro-optimizations are possible by distinguishing between strings that can only be ASCII and those that can be any Unicode; these probably aren't worth doing and things are simpler by just assuming UTF8 everywhere).

Known issues