openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Refactor HTML rewriter class to make it more open to change and expressive #343

Closed benoit74 closed 2 months ago

benoit74 commented 3 months ago

Fix #305

Idea: we should be able to define individual rewriting rules with functions and decorators, just like we do when defining endpoints in FastAPI. We have one decorator per kind of modification: drop attribute, rewrite attribute, rewrite data (and maybe others) and we implement one decorated function per kind of modifications expected.

Changes:

See https://github.com/openzim/warc2zim/pull/353 for a show case of how this new code structure make it easier to:

benoit74 commented 3 months ago

@rgaudin this is far from being done, but I would like to have a first feedback on this approach

For now only the simple "drop attribute" case is implemented, but it gives a rough idea of what it will look like if I continue on this approach.

I'm a bit surprise I had to implement the call_func helper function myself, I'm pretty sure it is possible without this custom code and Python stdlib code, but I failed to find how to write it properly. Idea is that I want to pass whatever argument are defined in the decorated function but not other ones.

benoit74 commented 2 months ago

Refactoring work is now completed, all HTML rewrite operations are isolated in standalone methods.