Python module with bits of code (objects, functions) highly-reusable within the impresso project.
Please refer to the documentation for further information on this library.
With pip
:
pip install impresso-commons
The library supports configuration of s3 credentials via project-specific local .env files.
The second project 'impresso - Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio' is funded by the Swiss National Science Foundation (SNSF) under grant number CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Aiming to develop and consolidate tools to process and explore large-scale collections of historical newspapers and radio archives, and to study the impact of this tooling on historical research practices, Impresso II builds upon the first project – 'impresso - Media Monitoring of the Past' (grant number CRSII5_173719, Sinergia program). More information at https://impresso-project.ch.
Copyright (C) 2024 The impresso team (contributors to this program: Matteo Romanello, Maud Ehrmann, Alex Flückinger, Edoardo Tarek Hölzl, Pauline Conti).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU Affero General Public License for more details.
The versioning
package of impresso_commons
contains several modules and scripts that allow to version Impresso's data at various stages of the processing pipeline.
The main goal of this approach is to version the data and track information at every stage to:
Impresso's data processing pipeline is organised in thre main data "meta-stages", mirroring the main processing steps. During each of those meta-stages, different formats of data are created as output of processes and in turn used as inputs to other downstream tasks.
TODO: Update/finalize the exact list of stages once every stage has been included.
The versioning aiming to document the data at each step through versions and statistics is implemented through manifest files, in JSON format which follow a specific schema. (TODO update JSON schema with yearly modif date.)
After each processing step, a manifest should be created documenting the changes made to the data resulting from that processing. It can also be created on the fly during a processing, and in-between processings to count and sanity-check the contents of a given S3 bucket. Once created, the manifest file will automatically be uploaded to the S3 bucket corresponding to the data it was computed on, and optionally pushed to the impresso-data-release repository to keep track of all changes made throughout the versions.
compute_manifest.py
scriptThe script compute_manifest.py
, allows one to compute a manifest on the data present within a specific S3 bucket.
The CLI for this script is the following:
python compute_manifest.py --config-file=<cf> --log-file=<lf> [--scheduler=<sch> --nworkers=<nw> --verbose]
Where the config_file
should be a simple json file, with specific arguments, all described here.
nworkers
can be used to specify any desired value).scheduler
parameter.It's also possible to compute a manfest on the fly during a process. In particular when the output from the process is not stored on S3, this method is more adapted; eg. for data indexation. To do so, some simple modifications should be made to the process' code:
Instantiation of a DataManifest object: The DataManifest
class holds all methods and attributes necessary to generate a manifest. It counts a relatively large number of input arguments (most of which are optional) which allow a precise specification and configuration, and ease all other interactions with the instantiated manifest object. All of them are also described in the manifest configuration:
manifest = DataManifest(
data_stage="passim", # DataStage.PASSIM also accepted
s3_output_bucket="32-passim-rebuilt-final/passim", # includes partition within bucket
s3_input_bucket="22-rebuilt-final", # includes partition within bucket
git_repo="/local/path/to/impresso-pycommons",
temp_dir="/local/path/to/git_temp_folder",
staging=False, # If True, will be pushed to 'staging' branch of impresso-data-release, else 'master'
is_patch=True,
patched_fields=["series", "id"], # example of modified fields in the passim-rebuilt schema
previous_mft_path=None, # a manifest already exists on S3 inside "32-passim-rebuilt-final/passim"
only_counting=False,
notes="Patching some information in the passim-rebuilt",
push_to_git=True,
)
Addition of data and counts: Once the manifest is instantiated the main interaction with the instantiated manifest object will be through the add_by_title_year
or add_by_ci_id
methods (two other with "replace" instead also exist, as well as add_count_list_by_title_year
, all described in the documentation), which take as input:
get_count_keys
method or the NewspaperStatistics class. The values corresponding to each key can be computed by the user "by hand" or by using/adapting functions like counts_for_canonical_issue
(or counts_for_rebuilt
) to the given situation. All such functions can be found in the versioning helpers.py.
"content_items_out"
and "issues"
.# for all title-years pairs or content-items processed within the task
counts = ... # compute counts for a given title and year of data or content-item
# eg. rebuilt counts could be: {"issues": 45, "content_items_out": 9110, "ft_tokens": 1545906}
# add the counts to the manifest
manifest.add_by_title_year("title_x", "year_y", counts)
# OR
manifest.add_by_ci_id("content-item-id_z", counts)
Computation, validation and export of the manifest: Finally, after all counts have been added to the manifest, its lazy computation can be triggered. This corresponds to a series of processing steps that:
[...] # instantiate the manifest, and add all counts for processed objects
# To compute the manifest, upload to S3 AND push to GitHub
manifest.compute(export_to_git_and_s3=True)
# OR
# To compute the manifest, without exporting it directly
manifest.compute(export_to_git_and_s3=False)
# Then one can explore/verify the generated manifest with
print(manifest.manifest_data)
# To export it to S3, and optionally push it to Git if it's ALREADY BEEN GENERATED
manifest.validate_and_export_manifest(push_to_git=[True or False])
The manifests use semantic versioning, where increments are automatically deduced based on the changes made to the data during a given processing or since the last manifest computation on a bucket. There are two main "modes" in which the manifest computation can be configured:
only_counting=False
):
only_counting=True
):
"last_modification_date"
field of the manifest, associated to statistics would stay the same for any title for which no changes were identified).When the computing of a manifest is launched, the following will take place to determine the version to give to the resulting manifest:
output_bucket
provided_, this manifest will be read and updated. Its version will be the basis to identify what the version increment should be based on the type of modifications.output_bucket
provided_, the there are two possibilities:
previous_mft_s3_path
is provided, with the path to a previously computed manifest which is present in another bucket. This manifest is used as the previous one like described above to update the data and compute the next verison.previous_mft_s3_path
is not provided, then this is the original manifest for a given data stage, and the version in this case is 0.0.1. This is the case for your first manifest.Based on the information that was updated, the version increment varies:
newspapers
parameter to indicate which data (within the media_list
) to consider and update.is_patch
or patched_fields
parameters are set to True_. The processing or ingestion versioned in this case is a patch, and the patched_fields will be updated according to the values provided as parameters.only_counting
parameter is set to True_.