openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Fuzzy Matching Improvements / POST requests #80

Closed ikreymer closed 2 years ago

ikreymer commented 3 years ago

warc2zim now has a set of fuzzy matching rules (https://github.com/openzim/warc2zim/blob/master/src/warc2zim/main.py#L75) which are a subset of the larger ruleset in wabac.js (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js#L8)

(pywb also has rules in python that are mostly aligned with the wabac.js rules https://github.com/webrecorder/pywb/blob/master/pywb/rules.yaml)

This many different rule sets is definitely a concern when it comes to maintenance, so perhaps should at least try to have warc2zim use the wabac.js rules, since wabac.js is used for replay. These rules could be easily exposed as a json file that is loaded similar to the sw.js

Ideally, warc2zim would not need any rules and wabac.js could just read from zim using existing rules, but this is not possible for two issues:

Since prefix querying is not possible when loading from ZIM, the alternative is a custom canonicalization option, which wabac.js also supports: We create a fake redirect', eg:

https://example.fuzzy.replayweb.page/?A=B which redirects to https://example.com/?A=B&_=1234 in the ZIM Then, when wabac.js encounters https://example.com/?A=B&_=1235, it also maps to https://example.fuzzy.replayweb.page/?A=B, and so is able to do the lookup.

This does work but is less flexible than the prefix search, as there is only possible match.

For example, lets say a URL is the same but can only be distinguished by the POST data, which contains {"videoid": "A"} A combined URL after reading the request and response can then be: https://example.com/?_=1234&__post_json_data={"videoid": "A"}, and the previous prefix search for https://example.com/? can find the best match.

For ZIMs, we'll need to do more work, though. The POST request must now also be parsed and a 'fake' redirect URL, probably something like https://example.fuzzy.replaywebpage/?__post_json_data={"videoid": "A"} generated.

This is doable, and can be added, but just wanted to raise awareness as this means creating (and continuing to maintain) a slightly different fuzzy matching scheme for ZIMs than exist for WARCs in wabac.js. The only possible alternatives, it seems, would be to allow for:

This issue is now coming up with youtube as youtube is making a POST request to the same URL, only difference is in the POST data (mentioned in webrecorder/browsertrix-crawler#4). The existing POST handling + prefix system means that replayweb.page is able to replay this new youtube playrer in WARCs + WACZ, but not ZIMs

Let me know if this makes sense, or can elaborate further..

ikreymer commented 3 years ago

A quick update: with the latest commit, this version now works with latest Vimeo videos.

Youtube will still require POST request handling, as mentioned above. The simplest solution is to add that, as that's what was done in pywb and wabac.js, and zim replay requires a modified system as discussed above.

What this involves is looking at the WARC request record, and if it is a POST request, and the content-type is either json or form encoding, the POST request is added to the URL as a query. Then, the special rule is applied to add a fuzzy matching redirect.

rgaudin commented 3 years ago

Thanks for all the details. Trying to understand exactly what each option would imply in terms of changes and maintenance. Surely having prefix search in ZIM (in readers actually, the libzim do provides this feature) saves duplication and possible bugs but it might mean changing every ZIM reader in an out-of-spec way…

Will try to look at the code to understand the other option better.

ikreymer commented 3 years ago

The immediate solution for youtube is to add the POST request mapping. Thinking about it more, there is no way around that, even with prefix support.

Here's an example of the latest conversion function, which now handles both form and JSON data now: https://github.com/webrecorder/wabac.js/blob/main/src/utils.js#L105

Without adding the POST data, we would end up with duplicate URLs like https://www.youtube.com/youtubei/v1/player?key=<some key> for each video, and can only store one in the ZIM.

So probably should implement something like the above function in warc2zim..

rgaudin commented 3 years ago

OK, thanks, it's a bit clearer now. Discussed this with @kelson42 and we confirm it's not possible to provide a prefix search API at the moment as this is too big of a concept change for the format/reader.

So we'll go with your other option. Let's discuss on slack if/how we can split the workload and maybe refactor those pieces so that it's easier to maintain. We should anyway have a better understanding of the replayer parts at play. We've kinda neglected it since it was maintained in webac.js

rgaudin commented 3 years ago

@ikreymer what's the status of video-replay-fixes branch? Should we merge that in ?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Popolechien commented 3 years ago

Hi @ikreymer any update on this?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

ikreymer commented 2 years ago

This is being addressed by #83