aflueckiger commented 4 years ago

The procedure and make instructions may be not self-contained and readily understandable as there are a few hacks and workaround. This result is largely caused by anomalies (non-consistencies) in the original data. A brief report is needed that documents the most important parts.

aflueckiger commented 4 years ago

This documentation is outdated, please refer to the readme

Mini-Report on Transforming the Federal Gazette Archive into Canonical Data

This brief report describes the transformation of the raw data from the Federal Gazette into a canonical format to ingest it into the Impresso platform. The collection and processing of data follow an organization per language (DE, FR, IT). The directories have the following structure: data_XX/RESOURCE/YEAR/MONTH/DAY/EDITION.

The processing pipeline builds on GNU make and follows a recursive structure. The recursive call of the main Makefile is necessary as not all the files can be tracked over the entire process due to unknown outputs from some of the invoked python scripts.

To start the entire pipeline for a single language run the following command: make -f Makefile run-pipeline-% TODO write the final pipeline

The wildcard % has to be replaced with the canonical name of the resource corresponding to the respective language: FedGazDe, FedGazFr FedGazIt In the case of the German FedGazDe, for example, the recipe recursively runs the following sub-recipes:

make -f Makefile download-index-all (download the entire index for all languages)

make -f Makefile download-FedGazDe.bash (get PDF and metadata)

make -f Makefile extract-tif-FedGazDe.bash (extract tif from pdf files)
make -f Makefile rename-tif-FedGazDe.bash (rename and decompress tif from pdf files)
make -f Makefile jp2-FedGazDe-target (convert to canonical jpeg2000)

make -f Makefile tetml-word-FedGazDe-target (create tetml from pdf files)
make -f Makefile data-ingest-FedGazDe (ingest into the Impresso platform)

Other functions that are not part of the core pipeline:

A script to produce random image samples to evaluate the visual alignments between the coordinates of textual elements in canonical JSON files and the canonical images: make -f Makefile eval_coordinates_FedGazDe

Download and Description of the Data

Firstly, the metadata is retrieved from the official website and stored into article-info-FedGazDe.tsv. The crawled links recorded in this file are subsequently used to download the corresponding PDF. Together, the metadata file and the PDFs, is the only data that is used by the canonization procedure described here. A single PDF corresponds to a single article and has a unique identifier as its filename (e.g. 10115466.pdf). In the case of an in-page article segmentation, the last page comprising the boundary is redundantly assigned to both, the former as well as to the latter article (i.e., PDF). For the sake of consistency, such pages need additional processing during the transformation into canonical data.

Canonical Textual Content

The textual content of the documents with applied OCR is rendered as invisible elements and is extracted using the TET library. The extraction of TETML works on the level of words while preserving all details about individual glyphs to subsequently compute the coordinates of the bounding boxes forming words and paragraphs respectively. Moreover, the hyphenation of words is kept and reconstructed at a later stage. The TET extraction routine ignores pages that only contain metadata by defining the page range comprising the actual content. The resulting TETML may be directly fed into the generic tetml-importer provided as part of our Impresso-text-aquisitation package. The tetml-importer parses the TETML issue-wise and transforms it into the canonical JSON format used by the Impresso platform. Specifically, the procedure yields a JSON file with an issue overview per year along with the JSON files with the actual content. The coordinates are computed relative to the size of the TIF image (actual scan) whereby a specific text element gets the coordinates of the bounding box surrounding all lower level elements (region > paragraphs > lines > words). Yet, to allow for a logical article segmentation rather than a physical article segmentation avoiding duplicated content (see above), the tetml-importer-fedgaz has to be used. This custom importer inherits most of its class methods from the generic importer and accounts for the specificities of the FedGaz data. This importer may be well used for other resources that need a post hoc logical article segmentation.

Logical Article Segmentation

Since the page numbers in the metadata maybe not in line with the physical page numbering printed on the respective pages, the procedure also requires some additional data. For example, some articles belonging to the appendix (Beilagen) may have a length of one page and, according to the official metadata, overlap with other articles at the end of the issue. To account for this incorrect information, pdfinfo collects the actual length of articles. Based on the original metadata and the physical length, a messy script computes the actual article span and indicates the potential overlap with multiple heuristics. The extended metadata are saved (e.g. article-info2-FedGazDe.tsv) and used in the process of the logical article segmenation. As a requirement of the tetml-importer-fedgaz, the most recent metadata file is copied to the top-level directory of the TETML folder and named metadata.tsv. The importer uses this resource to look up additional information and narrow down on article candidates with potential in-page boundaries. Specifically, the metadata needs to include the following columns: article_docid (used as lookup key), article_title, volume_language, pruned, canonical_page_first and canonical_page_last (indicating the last page of an article covering a full page and, thus, excluding a potential remainder before a in-page boundary). This non-generic importer implements a logical article segmentation to determine the actual boundary and reassign the content to the respective article. To detect the boundary, a fuzzy match procedure tries to find the title of the subsequent article within its first page. The fuzziness is defined as a function of the length of the shortened title and works only in reasonably well-restored OCR-texts. A too relaxed threshold would lead to a severe loss of performance and false-positives. In cases where the boundary could not be found, the remainder of an article gets assigned to the next article ensuring non-redundant content.

Canonical Images

Due to limited conversion functionalities, the canonical images have to be transformed over several steps. Firstly, TET extract TIF files from all the PDF containing scanned documents. As in the textual data transformation process, the images are only extracted up to the last full page to avoid redundant images because of in-page article boundaries. As some of the TIF files are compressed, jbig2dec is used to decode the images. Subsequently, the black/white images are converted into 8bit grayscale using ImageMagick before they get compressed by OpenJPEG as jpeg2000 with a ratio of 15 to publish on the Impresso platform.

TODO upload JPGEG2000 to Impresso

aflueckiger commented 3 years ago

integrate this report into readme with

impresso / federal-gazette