openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Introduce the `MetadataDict` #96

Closed FledgeXu closed 1 year ago

FledgeXu commented 1 year ago

The goal this is PR is trying to fix three issues: #95, #87, #94 This PR includes three parts of the change:

  1. Add the MetadataDict class to provide some convenient methods.
  2. Modify the Creator class to use the MetadataDict
  3. Remove update_metadata, and adjust the test case.
rgaudin commented 1 year ago

Busy with something now but I'll send later a suggestion of API design/changes.

FledgeXu commented 1 year ago

Before I fix the set of problems mentioned in the Review, I would like to clarify the requirements.

  1. we don't need to set values for all Mandatory Metadata
  2. we want to throw an Exception if a Mandatory Metadata has a value that is not set
  3. we want to check whether the Date and Language keys are valid

Have I misunderstood?

FledgeXu commented 1 year ago

After refactoring, I currently limit the functionality of this PR to implementing #95. Introducing this PR will cause test_noindexlanguage to fail in existing unit tests. All other unit tests have passed and the unit test coverage is 100% Also should fix all formatting issues and also add to Docstring.

rgaudin commented 1 year ago

Here are my thoughts on what the API should look like. It goes a bit past the scope of those issues but it's a good opportunity

# sample usage
creator = Creator(fpath).config_metadata(Language="eng", ... Supplier="Myself")
creator.start()
# adding other text or non-text metadata after start
creator.add_metadata("cover", mimetype="image/jpeg", content=cover.read())
# using the dev shortcut
with Creator(fpath).config_dev_metadata(Name="test01") as zim:
   ...

The config_metadata signature could be like

def config_metadata(
        self,
        *,
        Language: str,
        Title: str,
        Description: str,
        LongDescription: Optional[str] = None,
        Creator: str,
        Publisher: str,
        Date: Union[datetime.datetime, datetime.date, str],
        Illustration_48x48_at_1: bytes,
        Tags: Optional[Union[Iterable[str], str]] = None,
        Scraper: Optional[str] = None,
        Flavour: Optional[str] = None,
        Source: Optional[str] = None,
        License: Optional[str] = None,
        Relation: Optional[str] = None,
        **extras,  # would be only text metadata
    ) -> Self:
        return self
        ...

def config_debug_metadata(self, **metadata) -> Self:
  # sets all the mandatory metadata to dummy values like "Dev Title" or a blank PNG
  # passes the one requested and any additional
   ...

Let me know what you think. It's clear to me but let me know.

I like that with this, the metadata is self documented with required and standard ones exposed.

FledgeXu commented 1 year ago

Thanks for your feedback, I like this idea! I think that moving the config_metadata to a separate class like CreatorConfig might be a good idea. Under the current design, if we want to create a Creator class with with would be like this:

with Creator(fpath).config(
    Language="en",
    Title="placeholder",
    Description="placeholder",
    Creator="placeholder",
    Publisher="Publisher",
    Date="1970-01-01",
) as zim:
    pass

It looks kind tedious as a with statement. However, if we move the this part to the CreatorConfig, it will look like this:

creator_config = CreatorConfig(fpath).config(
    Language='en',
    Title='placeholder',
    Description='placeholder',
    Creator='placeholder',
    Publisher='Publisher',
    Date='1970-01-01',
    )
with Creator(creator_config) as zim:
    pass

It's much clear, IMO.

And under this design, we can also keep the update_metadata in the configuration stage, like this:

creator_config = CreatorConfig(fpath).config(
    Language='en',
    Title='placeholder',
    Description='placeholder',
    Creator='placeholder',
    Publisher='Publisher',
    Date='1970-01-01',
    )

creator_config.update_metadata("Language", mimetype="text/plain", content="fra")
# Also we can have other functions here:
creator_config.delate_metadata("cover)

with Creator(creator_config) as zim:
    pass
FledgeXu commented 1 year ago

And we may be better off closing this PR and moving the discussion to the issue.

rgaudin commented 1 year ago

@FledgeXu thanks for you comments. Beware that I wrote config_metadata() and not config(). We already have config_*() methods on the Creator (from pylibzim).

As for the tediousness of the API, I think it's similar to what you propose

creator = Creator(fpath).config_metadata(
    Language='en',
    Title='placeholder',
    Description='placeholder',
    Creator='placeholder',
    Publisher='Publisher',
    Date='1970-01-01',
    )
with creator as zim:
    pass

As for the implementation in a class, that might be easier to maintain ; but that should be a mixin for the Creator. You may want to check that it works fine as Creator already inherits from Cython one and I'm not sure how well that would work.

Feel free to close this PR if you want to start another one from a clean state.