impresso / impresso-pycommons

Python module with bits of code (objects, functions) highly reusable within impresso.
http://impresso-pycommons.rtfd.io/
GNU Affero General Public License v3.0
3 stars 3 forks source link

Create versioning manifest generation stand-alone script for semantic enrichment steps #83

Closed piconti closed 1 month ago

piconti commented 4 months ago

For some of the semantic enrichment processing steps, generating the versioning manifest during the processing itself can prove coplicated. However, computing it post-processing based on the contents of the S3 bucket would be much simpler and intuitive. This would be trhough a relatively generic script that leverages various aggregating functions based on the type of the processing output to version. The first step would be to implement it for NER and NEL.

This issue entails:

Very related to issue #81.

piconti commented 4 months ago

The script is written and ready for NER and NEL tasks. I will update and modify them for the other processing steps as I meet with each person responsible and we devise together which stats would need to be tracked.

simon-clematide commented 4 months ago

Good. I'm working on the distilled language identification and that would be a simple use case for me.

piconti commented 4 months ago

Great, the logic works both as a callable function and as CLI. There is also already some documentation on the config information I would need from you part (more will come). If anything is unclear I can clarify it when we look at the stats for the langid (& other processings).

simon-clematide commented 3 months ago

Given that a natural unit of processing is a Newspaper-Year jsonl.bz2 file, a modification time stamp would be needed on this level. Then, manifests are getting more interesting. Otherwise, just collecting information from s3 is easier.

Another issue is that from the config file it seems not to be possible to steer the versioning. I ran the script twice and the this is the result: Once a v0-0-1 and then v0-1-0 . This is magic that I don't understand.

image
simon-clematide commented 3 months ago

See https://github.com/impresso/s3-manifest-versioning for a repo that could host all s3 to s3 manifest generation data production.

piconti commented 3 months ago

Hi @simon-clematide, thank you for you comments that's a good idea!

About the modification dates at the title-year level:

About the versioning:

I hope I was able to clarify this confusion, I'm sorry that I have not had to the time to properly document the manifest yet, I'll work on it asap. In general, we are adapting and updating the logic as we encounter new situations and case-scenarios that call for it, like with the modification date at the year level, and are improving upon them iteratively until the match our need as best as possible. :)

simon-clematide commented 3 months ago

What happens if I set only_counting=True , but the data actually changed?

simon-clematide commented 3 months ago

Ok, thanks for the explanations. I added only_counting=True, reran everything and it worked as expected. image

BTW, I also updated the Pipfile in the s3-manifest repo. So a first stable version supporting langident is there.

piconti commented 3 months ago

No problem, happy that it worked out :)

What happens if I set only_counting=True , but the data actually changed?

Then for the specific entries of the newspaper titles that were modified, the information is updated as it would normally, and the version in increased as it would have been without (hence minor or major based on the type of change).

repo that could host all s3 to s3 manifest generation data production.

Just to be sure, you mean all the manifest generation code, right?