Highlight text in documents
txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scan an input document and creates a modified version with highlights embedded.
Current file formats supported:
The easiest way to install is via pip and PyPI
pip install txtmarker
You can also install txtmarker directly from GitHub. Using a Python Virtual Environment is recommended.
pip install git+https://github.com/neuml/txtmarker
Python 3.8+ is supported
The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.
Notebook | Description | |
---|---|---|
Introducing txtmarker | Overview of the functionality provided by txtmarker | |
Highlighting with Transformers | AI-driven highlighting with Transformers |
The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
extension: string
Type of highlighter to create (i.e. pdf)
formatter: callable
Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.
chunks: int
Splits queries into multiple chunks. This is designed for very long text matches.
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
infile: string
Full path to input file
outfile: string
Full path to output file, i.e. the highlighted file
highlights: list of (string, string|regex)
List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression.