impresso / impresso-pycommons

Python module with bits of code (objects, functions) highly reusable within impresso.
http://impresso-pycommons.rtfd.io/
GNU Affero General Public License v3.0
3 stars 3 forks source link

impresso_pycommons

Documentation Status PyPI version PyPI - License

Python module with bits of code (objects, functions) highly-reusable within the impresso project.

Please refer to the documentation for further information on this library.

Installation

With pip:

pip install impresso-commons

Notes

The library supports configuration of s3 credentials via project-specific local .env files.

License

The second project 'impresso - Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio' is funded by the Swiss National Science Foundation (SNSF) under grant number CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Aiming to develop and consolidate tools to process and explore large-scale collections of historical newspapers and radio archives, and to study the impact of this tooling on historical research practices, Impresso II builds upon the first project – 'impresso - Media Monitoring of the Past' (grant number CRSII5_173719, Sinergia program). More information at https://impresso-project.ch.

Copyright (C) 2024 The impresso team (contributors to this program: Matteo Romanello, Maud Ehrmann, Alex Flückinger, Edoardo Tarek Hölzl, Pauline Conti).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU Affero General Public License for more details.

Data Versioning

Motivation

The versioning package of impresso_commons contains several modules and scripts that allow to version Impresso's data at various stages of the processing pipeline. The main goal of this approach is to version the data and track information at every stage to:

  1. Ensure data consisteny and ease of debugging: Data elements should be consistent across stages, and inconsistencies/differences should be justifiable through the identification of data leakage points.
  2. Allow partial updates: It should be possible to (re)run all or part of the processes on subsets of the data, knowing which version of the data was used at each step. This can be necessary when new media collections arrive, or when an existing collection has been patched.
  3. Ensure transparency: Citation of the various data stages and datasets should be straightforward; users should know when using the interface exactly what versions they are using, and should be able to consult the precise statistics related to them.

Data Stages

Impresso's data processing pipeline is organised in thre main data "meta-stages", mirroring the main processing steps. During each of those meta-stages, different formats of data are created as output of processes and in turn used as inputs to other downstream tasks.

  1. [Data Preparation]: Conversion of the original media collections to unified base formats which will serve as input to the various data enrichment tasks and processes. Produces prepared data.
    • Includes the data stages: canonical, rebuilt, evenized-rebuilt and passim (rebuilt format adapted to the passim algorithm).
  2. [Data Enrichment]: All processes and tasks performing text and media mining on the prepared data, through which media collections are enriched with various annotations at different levels, and turned into vector representations.
    • Includes the data stages: entities, langident, text-reuse, topics, ocrqa, embeddings, (and lingproc).
  3. [Data Indexation]: All processes of data ingestion of the prepared and enriched data into the backend servers: Solr and MySQL.
    • Includes the data stages: solr-ingestion-text, solr-ingestion-entities, solr-ingestion-emb, mysql-ingestion.
  4. [Data Releases]: Packages of Impresso released data, composed of the datasets of all previously mentioned data stages, along with their corresponding versioned manifests, to be cited on the interface.

TODO: Update/finalize the exact list of stages once every stage has been included.

Data Manifests

The versioning aiming to document the data at each step through versions and statistics is implemented through manifest files, in JSON format which follow a specific schema. (TODO update JSON schema with yearly modif date.)

After each processing step, a manifest should be created documenting the changes made to the data resulting from that processing. It can also be created on the fly during a processing, and in-between processings to count and sanity-check the contents of a given S3 bucket. Once created, the manifest file will automatically be uploaded to the S3 bucket corresponding to the data it was computed on, and optionally pushed to the impresso-data-release repository to keep track of all changes made throughout the versions.

Computing a manifest - compute_manifest.py script

The script compute_manifest.py, allows one to compute a manifest on the data present within a specific S3 bucket. The CLI for this script is the following:

python compute_manifest.py --config-file=<cf> --log-file=<lf> [--scheduler=<sch> --nworkers=<nw> --verbose]

Where the config_file should be a simple json file, with specific arguments, all described here.

Computing a manifest on the fly during a process

It's also possible to compute a manfest on the fly during a process. In particular when the output from the process is not stored on S3, this method is more adapted; eg. for data indexation. To do so, some simple modifications should be made to the process' code:

  1. Instantiation of a DataManifest object: The DataManifest class holds all methods and attributes necessary to generate a manifest. It counts a relatively large number of input arguments (most of which are optional) which allow a precise specification and configuration, and ease all other interactions with the instantiated manifest object. All of them are also described in the manifest configuration:

    • Example instantiation:
    manifest = DataManifest(
        data_stage="passim", # DataStage.PASSIM also accepted
        s3_output_bucket="32-passim-rebuilt-final/passim", # includes partition within bucket
        s3_input_bucket="22-rebuilt-final", # includes partition within bucket
        git_repo="/local/path/to/impresso-pycommons",
        temp_dir="/local/path/to/git_temp_folder",
        staging=False, # If True, will be pushed to 'staging' branch of impresso-data-release, else 'master'
        is_patch=True,
        patched_fields=["series", "id"], # example of modified fields in the passim-rebuilt schema
        previous_mft_path=None, # a manifest already exists on S3 inside "32-passim-rebuilt-final/passim"
        only_counting=False,
        notes="Patching some information in the passim-rebuilt",
        push_to_git=True,
    )
  2. Addition of data and counts: Once the manifest is instantiated the main interaction with the instantiated manifest object will be through the add_by_title_year or add_by_ci_id methods (two other with "replace" instead also exist, as well as add_count_list_by_title_year, all described in the documentation), which take as input:

    • The media title and year to which the provided counts correspond
    • The counts dict which maps string keys to integer values. Each data stage has its own set of keys to instantiate, which can be obtained through the get_count_keys method or the NewspaperStatistics class. The values corresponding to each key can be computed by the user "by hand" or by using/adapting functions like counts_for_canonical_issue (or counts_for_rebuilt) to the given situation. All such functions can be found in the versioning helpers.py.
      • Note that the count keys will always include at least "content_items_out" and "issues".
    • Example:
    # for all title-years pairs or content-items processed within the task
    
    counts = ... # compute counts for a given title and year of data or content-item 
    # eg. rebuilt counts could be: {"issues": 45, "content_items_out": 9110, "ft_tokens": 1545906} 
    
    # add the counts to the manifest
    manifest.add_by_title_year("title_x", "year_y", counts)
    # OR
    manifest.add_by_ci_id("content-item-id_z", counts)
    • Note that it can be useful to only add counts for items or title-year pairs for which it's certain that the processing was successful. For instance, if the resulting output is written in files and uplodaded to S3, it would be preferable to add the counts corresponding to each file only once the upload is over without any exceptions or issues. This ensures the manifest's counts actually reflect the result of the processing.
  3. Computation, validation and export of the manifest: Finally, after all counts have been added to the manifest, its lazy computation can be triggered. This corresponds to a series of processing steps that:

    • compare the provided counts to the ones of previous versions,
    • compute title and corpus-level statistics,
    • serialize the generated manifest to JSON and
    • upload it to S3 (optionally Git).
    • This computation is triggered as follows:
    [...] # instantiate the manifest, and add all counts for processed objects
    
    # To compute the manifest, upload to S3 AND push to GitHub
    manifest.compute(export_to_git_and_s3=True) 
    
    # OR
    
    # To compute the manifest, without exporting it directly
    manifest.compute(export_to_git_and_s3=False)
    # Then one can explore/verify the generated manifest with
    print(manifest.manifest_data)
    # To export it to S3, and optionally push it to Git if it's ALREADY BEEN GENERATED
    manifest.validate_and_export_manifest(push_to_git=[True or False])

Versions and version increments

The manifests use semantic versioning, where increments are automatically deduced based on the changes made to the data during a given processing or since the last manifest computation on a bucket. There are two main "modes" in which the manifest computation can be configured:

When the computing of a manifest is launched, the following will take place to determine the version to give to the resulting manifest:

Based on the information that was updated, the version increment varies: