Create versioning manifest generation stand-alone script for semantic enrichment steps

piconti commented 4 months ago

For some of the semantic enrichment processing steps, generating the versioning manifest during the processing itself can prove coplicated. However, computing it post-processing based on the contents of the S3 bucket would be much simpler and intuitive. This would be trhough a relatively generic script that leverages various aggregating functions based on the type of the processing output to version. The first step would be to implement it for NER and NEL.

This issue entails:

[x] Creating of a script that instantiates and fills up a manifest given the appropriate input parameters
[x] Specific statisitics aggregating funtions for each processing, to put in versioning/helpers
[x] Documentation for the script
[x] Adapting the DataStatistics and DataManifest classes to integrate theses changes/updates.

Very related to issue #81.

piconti commented 4 months ago

The script is written and ready for NER and NEL tasks. I will update and modify them for the other processing steps as I meet with each person responsible and we devise together which stats would need to be tracked.

simon-clematide commented 4 months ago

Good. I'm working on the distilled language identification and that would be a simple use case for me.

piconti commented 4 months ago

Great, the logic works both as a callable function and as CLI. There is also already some documentation on the config information I would need from you part (more will come). If anything is unclear I can clarify it when we look at the stats for the langid (& other processings).

simon-clematide commented 3 months ago

Given that a natural unit of processing is a Newspaper-Year jsonl.bz2 file, a modification time stamp would be needed on this level. Then, manifests are getting more interesting. Otherwise, just collecting information from s3 is easier.

Another issue is that from the config file it seems not to be possible to steer the versioning. I ran the script twice and the this is the result: Once a v0-0-1 and then v0-1-0 . This is magic that I don't understand.

simon-clematide commented 3 months ago

See https://github.com/impresso/s3-manifest-versioning for a repo that could host all s3 to s3 manifest generation data production.

piconti commented 3 months ago

Hi @simon-clematide, thank you for you comments that's a good idea!

About the modification dates at the title-year level:

It's true that it could prove useful.
We do have partial information about what was modified or not in the manifest, with the updated years/fields:
- If updated_years is not empty, only the mentioned years are the ones that were updated in the last processing.
- This is where you're right in the usefulness of adding dates at the year level; if multiple manifests are generated between two ingestions or other similar situations, where a single title is modified twice in a row, we do not have the complete history at disposal within 1 manifest, but have to compare with previous manifests.
- I used this a lot when doing the patches (with a diff visualzation tool to quickly see changes), but you're entirely right that it could be improved upon with more fine-grained dates to allow for more programmatic approaches.
Once I'm able to fix my current problems with the text reuse, I'll add this to the manifests. Some aspects of the update logic might be a little tricky but I think it's completely manageable and will be a clear added-value.

About the versioning:

What you saw is normal, and will be more explicit once I have the opportunity to do the full documentation for manifests, it will come soon.
Currently how versioning works:
- _If a an existing version of the manifest for a given data stage exists in the output_bucket provided_, this manifest will be read and updated. Its version will be the basis to identify what the version increment should be based on the type of modifications.
- _If no such manifest exists and no manifest can be found in the output_bucket provided_, the there are two possibilities:
  - The argument previous_mft_s3_path is provided, with the path to a previously computed manifest which is present in another bucket. This manifest is used as the previous one like described above to update the data and compute the next verison.
  - The argument previous_mft_s3_path is not provided, then this is the original manifest for a given data stage, and the version in this case is 0.0.1. This is the case for your first manifest.
- Based on the information that was updated, the version increment varies:
- Major version increment if new title-year pairs have been added that were not present in the previous manifest.
- Minor version increment if:
  - No new title-year pairs have been provided as part of the new manifest's data, and the processing was not a patch.
  - This is in particular the version increment if we re-ingest or re-generate a portion of the corpus, where the underlying stats do not change. If a part of the corpus only was modified/reingested, the specific newspaper titles should be provided within the newspapers parameter to indicate which data (within the media_list) to consider and update.
- Patch version increment if:
  - The _only_counting parameter is set to True_.
    - This parameter is exactly made for the case scenarios where one wants to recompute the manifest on an entire bucket of existing data which has not necessarily been recomputed or changed (for instance if data was copied, or simply to recount etc).
    - The computation of the manifest in this context is meant more as a sanity-check of the bucket's contents.
    - The counts and statistics will be computed like in other cases, but the update information (modification date, updated years, git commit url etc) will not be updated unless a change in the statstics is identified (in which case the version is incremented accordingly).
  - The _is_patch or patched_fields parameters are set to True_. The processing or ingestion versioned in this case is a patch, and the patched_fields will be updated according to the values provided as parameters.
The option for only_counting is new (I implemented it last week after we realised with Maud that there was a need for this specific case scenario), which is why there might have been some confusion, I'm sorry.
In your situation:
- You generated a first manifest, with no previous manifest existing: initialized with v0-0-1.
- You launched a second computation of the manifest, on the same bucket (with most likely the same data?), with only_counting=False (my bad I should have explained it to you in more detail on friday). Hence the code found the previous manifest v0-0-1, identified that no new keys existed in the data and used the minor version increment accordingly, creating the v0-1-0. You also noticted that all the update_type entries changed from 'addition' to 'modification', because with this configuration, the manifest code considers that all the data was recomputed/modified between the two manifest computations.
- In this case, setting only_counting=True would save the update type information from one manifest to the next, and do a minor version increment, you would have had v0-0-2.

I hope I was able to clarify this confusion, I'm sorry that I have not had to the time to properly document the manifest yet, I'll work on it asap. In general, we are adapting and updating the logic as we encounter new situations and case-scenarios that call for it, like with the modification date at the year level, and are improving upon them iteratively until the match our need as best as possible. :)

simon-clematide commented 3 months ago

What happens if I set only_counting=True , but the data actually changed?

simon-clematide commented 3 months ago

Ok, thanks for the explanations. I added only_counting=True, reran everything and it worked as expected.

BTW, I also updated the Pipfile in the s3-manifest repo. So a first stable version supporting langident is there.

piconti commented 3 months ago

No problem, happy that it worked out :)

What happens if I set only_counting=True , but the data actually changed?

Then for the specific entries of the newspaper titles that were modified, the information is updated as it would normally, and the version in increased as it would have been without (hence minor or major based on the type of change).

repo that could host all s3 to s3 manifest generation data production.

Just to be sure, you mean all the manifest generation code, right?

impresso / impresso-pycommons

Create versioning manifest generation stand-alone script for semantic enrichment steps #83