openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Wabac fuzzy rules - update + process #335

Closed benoit74 closed 4 days ago

benoit74 commented 5 days ago

Fix #216

Changes:

benoit74 commented 4 days ago

Good points !

Aren't we planning on using a single, shared, file with browsertrix though? JS world is obviously more JSON-oriented than YAML.

Yes we are, but probably not in the coming weeks. Not sure which standard we might use then

Regarding escaping in JSON string, I agree we are not forced to do it, but it's required according to https://www.ietf.org/rfc/rfc4627.txt paragraph 2.5 (it is just a memo, so somehow we could ignore it, but still ...):

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\".

I will merge as-is, we can always rollback, it's not like it is a big change ^^