Code clone detection; clone-related bug detection; sematic clone analysis
This is a release package of Deckard -- a tree-based, scalable, and accurate code clone detection tool. It is also capable of reporting clone-related bugs.


In bash shell or cygwin, go into the folder:


and run the build script:


For convenience, can add "/path/to/src/main" into $PATH.

NOTE: Deckard's built-in parser previously cannot handle Java 1.5 or later features. It has been upgraded for Java 7 syntax. It should now be able to generate vectors for Java files that use Java 6 and 7 features.

NOTE: The compiled executables may not be "executable" (showing "Permission Denied") on Windows Vista/7 due to false alarms of UAC rules (based on file path/hash of a .exe). A simple (but may not be desirable) workaround is to run cygwin shell with elevated privileges before invoking the above scripts. Also, Deckard's performance may be tens of times slower when executed in cygwin than on Linux due to slow I/O operations.

To uninstall, go into the folder:


and simply run:



  1. For clone detection (suppose the source code of your application is in /path/to/app/src):

    • Specify the location of your source code, say /path/to/app/src.

    • Create a "config" file in /path/to/app/, following the sample "config" in samples/ or the template "config-sample" in scripts/clonedetect/. Make sure all paths are valid and the programming language is specified correctly.

    • (Optional) create other three directories in /path/to/app/ for storing outputs (see what's in samples/). These directories may be automatically created if specified in 'config'.

    • Batch mode run of clone detection (no bug detection by default):


    An optional parameter to the script is 'clean', 'clean_all', or 'overwrite'

    • Instead of running '', you may also run the scripts called in '' step-by-step by yourself:

    -- Vector generation: from where "config" is, run


    An optional parameter to the script is 'clean', 'clean_all', or 'overwrite'

    -- Vector clustering (i.e., clone detection): from where "config" is, run


    An optional parameter to the script is 'clean', 'clean_all', or 'overwrite'

  2. Vector generation for parts of a file:

    • Identify the source file name, say /path/to/src/ and the range [s, e] of line numbers you'd like to have a vector generated

    • Run "src/main/jvecgen [options] /path/to/src/ --start-line-number s --end-line-number e" Run "jvecgen -h" for more options. Note that different vecgen (cvecgen, jvecgen, phpvecgen) should be used for files in different languages.

This vecgen command will generate a vector representing the code between Line 's' and 'e' in the source file, and store the vector in "" by default.

  1. Detection of clone-related bugs:

    • Invoke 'bugfiltering' on a clone report file with a specified language, e.g.,

    /path/to/scripts/bugdetect/bugfiltering cluster_result c > bug_result

    • Optionally transform 'bug_result' to a html file for easier inspection of the reported potentially buggy clones in a web browser:

    /path/to/src/main/out2html bug_result > bug_result.html

    • See '' for how to run it in a batch mode (not enabled by default).

  1. Organization

The whole package is organized according to the several components in Deckard:

  1. Details about the clone/bug detection algorithms can be found in these two papers:

    • DECKARD: Scalable and Accurate Tree-based Detection of Code Clones, by Lingxiao JIANG, Ghassan MISHERGHI, Zhendong SU, and Stephane GLONDU. In the proceedings of 29th International Conference on Software Engineering (ICSE '07), Minneapolis, Minnesota, USA, 2007.

    • Context-Based Detection of Clone-Related Bugs, by Lingxiao JIANG, Zhendong SU, and Edwin CHIU. In the proceedings of the 6th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'07), Dubrovnik, Croatia, 2007.


  1. How to get the subtree representing each clone?

Each clone in the reports has a TBID and a TEID, in addition to the file name, and line numbers. The TBID and TEID uniquely identify the IDs of the first token and the last token in the clone from the original file (possibly containing parsing errors). To maintain consistent counting of the IDs, you should leave the work to "yyparse()" and Deckard's TokenCounter for how the IDs are calculated (see TraGenMain::run() for implementation details).

The following are the main steps for getting the subtree for a clone (please refer to "src/vgen/treeTra/token-tree-map.h" for more implementation details):

  1. How to get the vector for a line or a sequence of lines from a file?

    • Option 1: See above: Use "vector generation for parts of a file" with your scripts.

    • Option 2: Given the parse tree for a file (produced by TokenTreeMap::parseFile() and yyparse()) and the starting and ending line numbers, do the following:

    -- (If not done before,) Call Deckard's vector generator on the parse tree through TraGenMain::run, same as above. Please refer to src/main/, TraGenMain::run(int startln, int endln), and VecGenerator::traverse(Tree root, Tree init).

    -- Call the following function (c.f. src/include/ptree.h, src/main/ to return the smallest tree enclosing all elements from these lines:

    Tree* ParseTree::line2Tree(int startln, int endln)

    -- Then retrieve the vector (the actual vector generation is done beforehand):

    TreeVector* tv = TreeAccessor::get_node_vector(tree_node_pointer)

Enjoy and Feedback :=) @Deckard : Am I a clone?