openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
217 stars 25 forks source link

Use zimscraperlib to build zim #107

Closed dattaz closed 4 years ago

dattaz commented 4 years ago

Currently we have our own code to interface with zimwriterfs to build zim file after generating all content. Code : https://github.com/openzim/sotoki/blob/master/sotoki/sotoki.py#L1176:L1312

We should use our new https://github.com/openzim/python_scraperlib instead.

satyamtg commented 4 years ago

Hey @dattaz I'm working on it. Also, there are some other things that we can include in scraperlib as this scraper (in my opinion) contains a lot of code that is shareable.

rgaudin commented 4 years ago

@satyamtg, sure there are ! Yet for easier review and quicker merge, let's do it pieces by pieces.

satyamtg commented 4 years ago

Yup sure. I actually meant that lets solve this issue now and in the near future shift to scraperlib piece by piece.

satyamtg commented 4 years ago

@dattaz @rgaudin , it seems that in the ZimInfo class in zimscraperlib.zim, we don't have enough attributes to use it for ZIM creation in sotoki. sotoki has many extra zimwriterfs arguments such as --inflateHtml --redirects --flavour, and --source. Currently, there's no way to generate these extra zimwriterfs arguments even if we add these attributes to the ZimInfo class using the update() method. Only those attributes are converted into zimwriterfs arguments by to_zimwriterfs_args() method which are defined in the class. I would suggest that we implement the functionality to convert the freshly added extra attributes to zimwriterfs arguments in zimscraperlib.zim as we would need extra arguments for some scrapers. We can have a basic required number of arguments but at some point of time we would need to become flexible.

rgaudin commented 4 years ago

Hey, thanks for looking into this. Before you invest too much time on this, it's important to state that zimwriterfs is the historical way of writing ZIM files but definitely not the future.

Yesterday, a first PR was sent to the python-libzim repo so this is moving fast and the first candidate or zimwriterfs->binding is sotoki (and that would be done by the ones writing the binding).

So, I suggest we close that particular issue as clearly this will be replaced pretty soon.

That said, we probably want to support zimwriterfs in the zimscraperlib anyway so we might want to extend it. I'm not in favor of passing arbitrary parameters to a command line tool so let's just extend it to support the missing params.

Data that goes into the ZIM should be added to ZimInfo (source, flavour) and options that are passed to zimwriterfs should be a param to to_zimwriterfs_args().

    -w, --welcome       path of default/main HTML page. The path must be relative to HTML_DIRECTORY.
    -f, --favicon       path of ZIM file favicon. The path must be relative to HTML_DIRECTORY and the image a 48x48 PNG.
    -l, --language      language code of the content in ISO639-3
    -t, --title     title of the ZIM file
    -d, --description   short description of the content
    -c, --creator       creator(s) of the content
    -p, --publisher     creator of the ZIM file itself

    HTML_DIRECTORY      path of the directory containing the HTML pages you want to put in the ZIM file.
    ZIM_FILE        path of the ZIM file you want to obtain.

Optional arguments:
    -v, --verbose       print processing details on STDOUT
    -h, --help      print this help
    -V, --version       print the version number
    -m, --minChunkSize  number of bytes per ZIM cluster (default: 2048)

    -x, --inflateHtml   try to inflate HTML files before packing (*.html, *.htm, ...)

    -u, --uniqueNamespace   put everything in the same namespace 'A'. Might be necessary to avoid problems with dynamic/javascript data loading.
    -r, --redirects     path to a TSV file containing a list of redirects (namespace url title target_url).
    -j, --withoutFTIndex    don't create and add a fulltext index of the content to the ZIM.
    -a, --tags      tags - semicolon separated
    -e, --source        content source URL
    -n, --name      custom (version independent) identifier for the content
    -o, --flavour       custom (version independent) content flavour
    -s, --scraper       name & version of tool used to produce HTML content