Closed dattaz closed 4 years ago
Hey @dattaz I'm working on it. Also, there are some other things that we can include in scraperlib as this scraper (in my opinion) contains a lot of code that is shareable.
@satyamtg, sure there are ! Yet for easier review and quicker merge, let's do it pieces by pieces.
Yup sure. I actually meant that lets solve this issue now and in the near future shift to scraperlib piece by piece.
@dattaz @rgaudin , it seems that in the ZimInfo class in zimscraperlib.zim
, we don't have enough attributes to use it for ZIM creation in sotoki
. sotoki
has many extra zimwriterfs
arguments such as --inflateHtml
--redirects
--flavour
, and --source
. Currently, there's no way to generate these extra zimwriterfs arguments even if we add these attributes to the ZimInfo
class using the update()
method. Only those attributes are converted into zimwriterfs arguments by to_zimwriterfs_args()
method which are defined in the class. I would suggest that we implement the functionality to convert the freshly added extra attributes to zimwriterfs arguments in zimscraperlib.zim
as we would need extra arguments for some scrapers. We can have a basic required number of arguments but at some point of time we would need to become flexible.
Hey, thanks for looking into this. Before you invest too much time on this, it's important to state that zimwriterfs
is the historical way of writing ZIM files but definitely not the future.
Yesterday, a first PR was sent to the python-libzim repo so this is moving fast and the first candidate or zimwriterfs->binding is sotoki (and that would be done by the ones writing the binding).
So, I suggest we close that particular issue as clearly this will be replaced pretty soon.
That said, we probably want to support zimwriterfs in the zimscraperlib anyway so we might want to extend it. I'm not in favor of passing arbitrary parameters to a command line tool so let's just extend it to support the missing params.
Data that goes into the ZIM should be added to ZimInfo
(source
, flavour
) and options that are passed to zimwriterfs should be a param to to_zimwriterfs_args()
.
-w, --welcome path of default/main HTML page. The path must be relative to HTML_DIRECTORY.
-f, --favicon path of ZIM file favicon. The path must be relative to HTML_DIRECTORY and the image a 48x48 PNG.
-l, --language language code of the content in ISO639-3
-t, --title title of the ZIM file
-d, --description short description of the content
-c, --creator creator(s) of the content
-p, --publisher creator of the ZIM file itself
HTML_DIRECTORY path of the directory containing the HTML pages you want to put in the ZIM file.
ZIM_FILE path of the ZIM file you want to obtain.
Optional arguments:
-v, --verbose print processing details on STDOUT
-h, --help print this help
-V, --version print the version number
-m, --minChunkSize number of bytes per ZIM cluster (default: 2048)
-x, --inflateHtml try to inflate HTML files before packing (*.html, *.htm, ...)
-u, --uniqueNamespace put everything in the same namespace 'A'. Might be necessary to avoid problems with dynamic/javascript data loading.
-r, --redirects path to a TSV file containing a list of redirects (namespace url title target_url).
-j, --withoutFTIndex don't create and add a fulltext index of the content to the ZIM.
-a, --tags tags - semicolon separated
-e, --source content source URL
-n, --name custom (version independent) identifier for the content
-o, --flavour custom (version independent) content flavour
-s, --scraper name & version of tool used to produce HTML content
Currently we have our own code to interface with zimwriterfs to build zim file after generating all content. Code : https://github.com/openzim/sotoki/blob/master/sotoki/sotoki.py#L1176:L1312
We should use our new https://github.com/openzim/python_scraperlib instead.