Closed piconti closed 1 month ago
The script is written and ready for NER and NEL tasks. I will update and modify them for the other processing steps as I meet with each person responsible and we devise together which stats would need to be tracked.
Good. I'm working on the distilled language identification and that would be a simple use case for me.
Great, the logic works both as a callable function and as CLI. There is also already some documentation on the config information I would need from you part (more will come). If anything is unclear I can clarify it when we look at the stats for the langid (& other processings).
Given that a natural unit of processing is a Newspaper-Year jsonl.bz2 file, a modification time stamp would be needed on this level. Then, manifests are getting more interesting. Otherwise, just collecting information from s3 is easier.
Another issue is that from the config file it seems not to be possible to steer the versioning. I ran the script twice and the this is the result: Once a v0-0-1 and then v0-1-0 . This is magic that I don't understand.
See https://github.com/impresso/s3-manifest-versioning for a repo that could host all s3 to s3 manifest generation data production.
Hi @simon-clematide, thank you for you comments that's a good idea!
About the modification dates at the title-year level:
updated_years
is not empty, only the mentioned years are the ones that were updated in the last processing.About the versioning:
output_bucket
provided_, this manifest will be read and updated. Its version will be the basis to identify what the version increment should be based on the type of modifications.output_bucket
provided_, the there are two possibilities:
previous_mft_s3_path
is provided, with the path to a previously computed manifest which is present in another bucket. This manifest is used as the previous one like described above to update the data and compute the next verison.previous_mft_s3_path
is not provided, then this is the original manifest for a given data stage, and the version in this case is 0.0.1. This is the case for your first manifest.newspapers
parameter to indicate which data (within the media_list
) to consider and update.only_counting
parameter is set to True_.
is_patch
or patched_fields
parameters are set to True_. The processing or ingestion versioned in this case is a patch, and the patched_fields will be updated according to the values provided as parameters.only_counting
is new (I implemented it last week after we realised with Maud that there was a need for this specific case scenario), which is why there might have been some confusion, I'm sorry. v0-0-1
.only_counting=False
(my bad I should have explained it to you in more detail on friday). Hence the code found the previous manifest v0-0-1
, identified that no new keys existed in the data and used the minor version increment accordingly, creating the v0-1-0.
You also noticted that all the update_type
entries changed from 'addition' to 'modification', because with this configuration, the manifest code considers that all the data was recomputed/modified between the two manifest computations.only_counting=True
would save the update type information from one manifest to the next, and do a minor version increment, you would have had v0-0-2
.I hope I was able to clarify this confusion, I'm sorry that I have not had to the time to properly document the manifest yet, I'll work on it asap. In general, we are adapting and updating the logic as we encounter new situations and case-scenarios that call for it, like with the modification date at the year level, and are improving upon them iteratively until the match our need as best as possible. :)
What happens if I set only_counting=True
, but the data actually changed?
Ok, thanks for the explanations. I added only_counting=True, reran everything and it worked as expected.
BTW, I also updated the Pipfile in the s3-manifest repo. So a first stable version supporting langident is there.
No problem, happy that it worked out :)
What happens if I set only_counting=True , but the data actually changed?
Then for the specific entries of the newspaper titles that were modified, the information is updated as it would normally, and the version in increased as it would have been without (hence minor or major based on the type of change).
repo that could host all s3 to s3 manifest generation data production.
Just to be sure, you mean all the manifest generation code, right?
For some of the semantic enrichment processing steps, generating the versioning manifest during the processing itself can prove coplicated. However, computing it post-processing based on the contents of the S3 bucket would be much simpler and intuitive. This would be trhough a relatively generic script that leverages various aggregating functions based on the type of the processing output to version. The first step would be to implement it for NER and NEL.
This issue entails:
Very related to issue #81.