jjmccollum / open-cbgm

Fast, compact, open-source, TEI-compliant C++ implementation of the Coherence-Based Genealogical Method
MIT License
29 stars 1 forks source link

open-cbgm

Fast, compact, open-source, TEI-compliant C++ implementation of the Coherence-Based Genealogical Method

Version 1.7.0 Build Status MIT License DOI

About This Project

Introduction

The Coherence-Based Genealogical Method (CBGM) is a novel approach to textual criticism, popularized by the Institut für Neutestamentliche Textforschung (INTF) for its use in the production of the Editio Critica Maior (ECM) of the New Testament. It is a meta-method, combining methodology-dependent philological decisions from the user with efficient computer-based calculations to highlight genealogical relationships between different stages of the text. To establish genealogical relationships in the presence of contamination (understood to be a problem in the textual tradition of the New Testament), the CBGM makes a number of philosophical and methodological innovations, such as the abstracting of texts away from the manuscripts that preserve them (and the resulting rejection of hypothetical ancestors as used in traditional stemmata), the encoding of the textual critic's decisions in local stemmata of variants, and the use of coherence in textual flow to evaluate hypotheses about the priority of variant readings.

To learn more about the CBGM, see Tommy Wasserman and Peter J. Gurry, A New Approach to Textual Criticism: An Introduction to the Coherence-Based Genealogical Method, RBS 80 (Atlanta: SBL Press, 2017); Peter J. Gurry, A Critical Examination of the Coherence-Based Genealogical Method in the New Testament, NTTSD 55 (Leiden: Brill, 2017); Andrew Charles Edmondson, "An Analysis of the Coherence-Based Genealogical Method Using Phylogenetics" (PhD diss., University of Birmingham, 2019); and Gerd Mink, "Problems of a Highly Contaminated Tradition: The New Testament: Stemmata of Variants as a Source of Genealogy for Witnesses," in Pieter van Reenen, August den Hollander, and Margot van Mulken, eds., Studies in Stemmatology II (Philadelphia, PA: Benjamins, 2004), 13–85. The author of this library has also written an illustrated crash course on the CBGM, which is available at https://acu-au.academia.edu/JoeyMcCollum.

Design Philosophy

This is not the first software implementation of the CBGM. To our knowledge, the earliest software designed for this purpose was the Genealogical Queries tool developed by the INTF (http://intf.uni-muenster.de/cbgm2/GenQ.html). The underlying data was restricted to the Catholic Epistles, and both the collation data and the source code were inaccessible to the user. More recently, the INTF and the Cologne Center for eHumanities (CCeH) made significant updates in a version used for Acts (https://github.com/cceh/ntg, web app at https://ntg.cceh.uni-koeln.de/acts/ph4/). While this updated version is much more transparent with its source code and underlying collation data, it is not obvious how to do some of the more basic tasks (e.g., working with new collation data, reorienting local stemmata). Most recently, Andrew Edmondson developed an open-source Python implementation of the CBGM (https://github.com/edmondac/CBGM, DOI), but according to his own description, "It was designed for testing and changing the various algorithms and is not (therefore) the fastest user-facing package."

In light of the strengths and shortcomings of its predecessors, the open-cbgm library was developed with the following desired features in mind:

  1. open-source: The code should be publicly available and free to be used and modified by anyone.
  2. compact: The code should have a minimal memory footprint and minimal dependencies on external libraries. The input and output of the software should also be as concise as possible.
  3. fast: The code should be optimized with performance and scalability in mind.
  4. compliant: The input and output of the software should adhere as best as possible to established standards in the field, so as to facilitate its incorporation into established workflows.

To achieve the goal of feature (1), we have made our entire codebase available under the permissive MIT License. None of the functionality "under the hood" is obfuscated, and any user with access to the input file can customize local stemmata and connectivity parameters. (Incidentally, this fulfills a desideratum for customizability expressed by Wasserman and Gurry, A New Approach to Textual Criticism, 119–120.)

With respect to feature (2), the core libary makes use of only three external libraries: the cxxopts command-line argument parsing library (https://github.com/jarro2783/cxxopts), which within the core library is used only for test-running purposes; the pugixml library (https://github.com/zeux/pugixml), which allows for fast parsing of potentially large XML input files; and the CRoaring compressed bitmap library (https://github.com/RoaringBitmap/CRoaring), to encode pregenealogical and genealogical relationships succinctly and in a way that facilitates fast computation. The first two libraries were deliberately chosen for their lightweight nature, and the third was an easy choice for its excellent performance benchmarks. The only input source for the software is an XML file encoding collation data.

For feature (3), we made performance-minded decisions at all levels of design. In addition to our deliberate choices of the external libraries mentioned above, we designed our own code with the same goal in mind. We stored data keyed by witnesses in hash tables for efficient accesses on average. Due to the combinatorial nature of the problem of substemma optimization, we implemented common heuristics known to be effective at solving similar types of problems in practice. As a result, the open-cbgm library can parse the ECM collation of 3 John (consisting of 137 witnesses and 116 variation units), calculate the pre-genealogical and genealogical relationships between its witnesses,and write this data to the cache in just over two minutes, and it can generate a preliminary global stemma for the entire book in under ten seconds on a desktop computer.

Finally, for feature (4), we ensured compliance with the XML standard of the Text Encoding Initiative (TEI) by designing the software to expect an input in the TEI XML format used by the INTF and the International Greek New Testament Project (IGNTP) in their transcriptions and collations. (For specific guidelines, see TEI Consortium, eds. [2019], TEI P5: Guidelines for Electronic Text Encoding and Interchange (version 3.6.0), manual, TEI Consortium, http://www.tei-c.org/Guidelines/P5/ [accessed 1 January 2020], and Houghton, H.A.G. [2016], IGNTP Guidelines for XML Transcriptions of New Testament Manuscripts (version 1.5), manual, International Greek New Testament Project [unpublished].) We made this possible by encoding common features of the CBGM as additional TEI XML elements in the input collation file (e.g., the feature set <fs/> element for a variation unit's connectivity and the <graph/> element for its local stemma). Sample collation files demonstrating this encoding can be found in the examples directory.

The local stemma, textual flow, and global stemma classes are equipped with methods to serialize the contents of their graphs in .dot format. This format is the standard used by the graphviz visualization software (https://www.graphviz.org/), which in turn is the standard used by existing CBGM software.

Installation and Dependencies

To install the library on your system, you can clone this Git repository or download its contents in a .zip archive (click the "Code" tab near the top of this page). If you have Git installed (https://git-scm.com/), you can clone this repository from the command line using the command

git clone git://github.com/jjmccollum/open-cbgm.git

This will copy the latest version of the repository to an open-cbgm subdirectory of your current directory. From the command line, enter the new directory with the command

cd open-cbgm

As mentioned above, the open-cbgm library uses three external libraries (cxxopts, pugixml, and roaring). All three libraries are included as Git submodules of this repository; if you do not have the submodules initialized, then you must do so with the command

git submodule update --init --recursive

With respect to graph outputs, the open-cbgm library does not generate image files directly; instead, for the sake of flexibility, it is designed to generate textual graph description files in .dot format, which can subsequently be converted to image files in a variety of formats using graphviz. Platform-specific instructions for installing graphviz and directions on how to get image files from .dot outputs using graphviz can be found at https://www.graphviz.org.

Once you have all dependencies installed, you need to build the project. The precise details of how to do this will depend on your operating system, but in all cases, you will need to have the CMake toolkit installed. The platform-specific installation instructions at https://github.com/jjmccollum/open-cbgm-standalone can be used for this repository.

In the interest of modularity, and to facilitate the incorporation of this library into other applications and APIs, this repository contains only the core classes of the library, without a user-facing interface. A standalone command-line interface that uses a SQLite database as a genealogical cache is available at https://github.com/jjmccollum/open-cbgm-standalone. The core library is included in that interface's git repository as a submodule and can be installed within it using the installation instructions on that page. A web-facing API is also in the works; updates will be posted in the issues for this repository.

Building

If you wish to incorporate the open-cbgm library as a dependency for your own libraries or executables, you can build it by itself either as a static library or as a shared library. For a static library, invoking cmake and pointing to the directory containing the root-level CMakeLists.txt file will generate all of the appropriate Makefiles or Visual Studio project files to build the library statically. For a shared library, adding the -DBUILD_SHARED_LIBS=ON argument after cmake will generate the appropriate files to build the library dynamically. If you want to generate the unit tests for the library, add the -DBUILD_TESTS=ON argument; the test suite will be generated in the autotest executable.

Citation

To cite this software, please use the information associated with its DOI page: DOI.