Enhancement: Inversion of control scraper base class

josephlewis42 commented 2 months ago

When recently building the DevDocs scraper, I realized there are a ton of things that I was relying on @benoit74's expertise for to make the scraper sustainable for ZimFarm but weren't core to building a functional scraper. These include:

Caching if S3 is available.
Expected CLI flags.
Zim file parameters/naming/destinations.
ZimFarm operator friendly logging setup.
Progress tracker.
HTTP client setup.

The scraper still isn't fully there, but I've already spent a lot of time implementing and testing some of these things. I've written Logstash plugins and Elastic Beats (which are more smiilar to Zim scrapers and those were dramatically easier because that common logic was abstracted away.

I'd love to see something like the following (example only!) as what I had to build while knowing most of the above would be taken care of:

# New type ZimMetadata contains the properties to populate config_metadata() on a Zim.
# There are specific types based on whether the scraper is for a single Zim or multiple (in which case
# it supports placeholders).
# Methods can be overridden for fine-grained control e.g. to add additional formatting parameters.
M = TypeVar('M', bound=ZimMetadata) 

class MyScraper(MultiZimScraper):

  # Parent class includes a logger, HTTP client, potentially other items.

  def add_flags(parser: argparse.ArgumentParser):
    '''Add custom flags to the program'''
    pass

  def setup(namespace: argparse.Namespace):
    '''Parse flags and set up resources for execution.
    After this call, MultiZimScraper may have additional internal variables set up
    e.g. an HTTP client that automatically caches to S3 if running in ZimFarm and with
    retries/delay.
    '''
    pass

  def list_zims() -> M:
    '''Called after setup to list all Zims to be created. '''
    pass

  def add_contents(creator: Creator, metadata: M):
    '''Called for each item in list_zims().

    The JSON progress file is updated between calls, logs for progress/next ZIM/timing are written
    and a scraper check utility could be asserted after.
    '''
    pass

I don't think all scrapers would need to use this format, but something like it would have dramatically cut down on the amount of testing and knowledge needed for me to produce a quality Zim scraper.

rgaudin commented 1 month ago

Indeed providing a base scraper out of what is currently a strong convention would boost up new scraper development time and reduce maintenance overall.

Needs to be considered. @benoit74 I can't find your ticket suggesting a single-repo mega scraper. Where is it?

benoit74 commented 1 month ago

Thank you @josephlewis42 for the proposition, definitely needs to be considered, your points are really valid.

@benoit74 I can't find your ticket suggesting a single-repo mega scraper. Where is it?

Single-repo mega scraper was more a question in the back of my mind and a topic for the hackathon, I never opened an issue on this.

I started much smaller with only merging zimit and warc2zim repos: https://github.com/openzim/zimit/issues/303

That being said, I'm not sure we need to go mono-repo for the suggestion to be implemented. We can provide MultiZimScraper in python-scraperlib and still live with separate repos per scraper.

On this idea of having shared code, I would even like to consider sharing things like:

checking metadata at the right moment
starting multiprocessing "tasks" in add_contents to process "things" in parallel when needed
and probably even more I don't have in mind ATM

So definitely something useful to consider.

Thank you @josephlewis42 for taking time to write this issue and bringing this subject on the table.

openzim / python-scraperlib

Enhancement: Inversion of control scraper base class #195