openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
19 stars 16 forks source link

Fix handling of tag list in Creator.add_metadata (fix #125) #126

Closed IMayBeABitShy closed 7 months ago

IMayBeABitShy commented 8 months ago

Fixes #125

Previously, the type annotations of Creator.config_metadata() allowed "Tags" to be a list of strings, but didn't handle the conversion to a string, which resulted in python-libzim raising an error.

This commit modifies Creator.add_metadata() to accept a list of strings and, if the key of the metadata is "Tags", join them into a single string. A regression test is included. The metadata validation seems have already been written with a tag list in mind (though there is no check that a single tag in such a list should not contain ";", not sure if this is intended).

Note: the join has been implemented in Creator.add_metadata, but the conversion of datetime.datetime for the "Date" metadata key happens in python-libzim. I think splitting the conversion logic between those two libraries is suboptimal, we should consider implementing the tag list conversion in python-libzim instead.

rgaudin commented 8 months ago

Thank you ; it looks good (except maybe the comments). @benoit74 will review but merge might be delayed a bit as he is in the process of changing the setup of this repo so it may be easier for him to merge his stuff first.

benoit74 commented 7 months ago

Note: the join has been implemented in Creator.add_metadata, but the conversion of datetime.datetime for the "Date" metadata key happens in python-libzim. I think splitting the conversion logic between those two libraries is suboptimal, we should consider implementing the tag list conversion in python-libzim instead.

@rgaudin any views on this?

rgaudin commented 7 months ago

Note: the join has been implemented in Creator.add_metadata, but the conversion of datetime.datetime for the "Date" metadata key happens in python-libzim. I think splitting the conversion logic between those two libraries is suboptimal, we should consider implementing the tag list conversion in python-libzim instead.

@rgaudin any views on this?

Sure bothers me as well. I'd prefer we implement the date/datetime conversion here as well and eventually (needs a major release) remove the conversion there as pylibzim is just a wrapper around the C libzim.

benoit74 commented 7 months ago

Sure bothers me as well. I'd prefer we implement the date/datetime conversion here as well and eventually (needs a major release) remove the conversion there as pylibzim is just a wrapper around the C libzim.

Issue is opened in python-libzim: https://github.com/openzim/python-libzim/issues/188

IMayBeABitShy commented 7 months ago

I've removed the comments as requested. I've also added the datetime conversion here so all metadata value conversion happens in python-scraperlib.

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (cfee792) 100.00% compared to head (10ca3fa) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #126 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 32 32 Lines 1332 1337 +5 Branches 225 227 +2 ========================================= + Hits 1332 1337 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

benoit74 commented 7 months ago

I rebased your changes and it made me realize there was a typing issue + we can mutualise your new test with the existing one, this is the intent of my last commit.

As mentioned in this last commit, I had to ignore the typing issue, it was hard to solve without TypeGuards (3.10) or would necessitate a significant code change. We can live with it for now.

benoit74 commented 7 months ago

@rgaudin would you mind to review the last commit please, I'm not sure you will agree with my suggestion to ignore the typing issue for now.

rgaudin commented 7 months ago

Why not be safe with

if name == "Tags" and not isinstance(content, str) and isinstance(content, collections.abc.Iterable):
    content = ";".join(str(tag) for tag in content)
benoit74 commented 7 months ago

Thank you very much! I chose to be even safer with

if (
            name == "Tags"
            and not isinstance(content, str)
            and not isinstance(content, bytes)
            and isinstance(content, collections.abc.Iterable)
        ):
            content = ";".join(content)

Just in case some day we begin to accept bytes ... don't see why but who knows.