openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Enforce mandatory metadata #95

Closed rgaudin closed 1 year ago

rgaudin commented 1 year ago

Our Creator wrapper should make sure that we do supply valid (albeit possibly useless) values for the mandatory metadata required by the spec.

This would mean:

rgaudin commented 1 year ago

We've decided that in all of scraperlib's current use cases, we know all metadata before starting the Creator. So we'll raise an exception if any mandatory metadata is not set at start() time. We can revisit this behavior in the future.

We could maybe have an extra method that sets empty or random data for debug/test

FledgeXu commented 1 year ago

I would like to work on this, but I'm kind of confused about API designs. Does it mean that we need to add a determination in the code below that determines that all the mandatory metadata is not null?https://github.com/openzim/python-scraperlib/blob/e67f0145c5879708508a34a784f086d741fdbdd0/src/zimscraperlib/zim/creator.py#L117-L126

rgaudin commented 1 year ago

Yes. We also need to keep a reference to all metadata that has been set (overloading add_metadata for instance)

FledgeXu commented 1 year ago

As I was trying to solve this problem, I realized that If we want to solve this problem and several other issues(#87, #94) we need to adjust the design of the API of Creator. Maybe we can introduce a class named Metadata to do jobs like:

FledgeXu commented 1 year ago

And I'm not sure which level should we add it, python-libzim or python-scraperlib or libzim.

rgaudin commented 1 year ago

And I'm not sure which level should we add it, python-libzim or python-scraperlib or libzim.

scraperlib

On implementation, keep in mind that what matters most is comfort for people writing scrapers and/or creating ZIMs via Python.

You can pick another ticket if this one is not defined enough. It's actually mostly about API design and should not be considered a good first issue.

rgaudin commented 1 year ago

See https://github.com/openzim/python-scraperlib/pull/96 comments for updated description

kelson42 commented 1 year ago

https://github.com/openzim/libzim/issues/785 Might be of interest for this ticket.

rgaudin commented 1 year ago

Rah we forgot to attach this to the second PR. This has been implemented in 3.0.0.

Thanks for the pointer ; we'll see if and how it gets implemented but chances are we keep a separate logic because we have a lot more flexibility (we test image format and size for instance) and we go beyond the spec by applying the kiwix recommendations