ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
115 stars 25 forks source link

Add POST data records, for PyWB playback #244

Open anjackson opened 3 years ago

anjackson commented 3 years ago

To get playback working, we need to make HEAD/OPTIONS/POST records like PyWB does. See https://github.com/webrecorder/pywb/issues/585 and related tickets.

It's fairly involved! https://github.com/webrecorder/pywb/blob/54d8bccf4a4eebf305012d49cb7330eaddea9eba/pywb/warcserver/inputrequest.py#L183

Will replace/supercede https://github.com/ukwa/webarchive-discovery/blob/a166803280fc62e51c4dcf4ee8acd7ac6ee38f4c/warc-hadoop-recordreaders/src/main/java/uk/bl/wa/hadoop/mapreduce/cdx/TinyCDXServerReducer.java#L86-L95

Note that to be useful, we need to upgrade to nlagovau/outbackcdx:0.8.0.

thomasegense commented 3 years ago

More information from Ilya:

Hi, I am working on trying to standardize POST request indexing across all the different Webrecorder tools, and support additional improvements.. This probably calls for a write-up, but just wanted to share what the idea is so far:

the POST request data, if possible is converted to query (form-encoded) form and treated as part of the URL, in a sense, converting the POST request to a GET
this can also apply to PUT or any other requests

The CDXJ entry would look like this:

org,httpbin)/post?__wb_method=post&another=more^data&test=some+data 20200809195334 {"url": "https://httpbin.org/post", "mime": "application/json", "status": "200", "digest": "7AWVEIPQMCA4KTCNDXWSZ465FITB7LSK", "length": "688", "offset": "0", "filename": "post-test-more.warc", "requestBody": "?__wb_method=POST&test=some+data&another=more%5Edata", "method": "POST"}

the canonicalized key has this extra query appended to it, along with _wb_method
the url field is not modified
the url-encoded query form is stored in requestBody field and also an extra method field is added

the requestBody is for: application/x-www-form-urlencoded - already in this form use as is multipart/form-data - convert to url-encoded query application/json - parse the json and add each primitive to the query, eg. {"a": "b", "foo": {"c": "d"}} becomes a=b&c=d (is better approach possible) text/plain - assume it may be json, try to parse application/json , otherwise treat as binary/other binary/all other - base64 encode and add as _wb_post_data=<base 64 data>**

anjackson commented 3 years ago

Thanks @thomasegense - I'm afraid I'm probably going to switch to using the PyWB indexer for now, as modifying this codebase to pull together the request and reponse records is going to mean significant changes to the way it works. I don't current have time to make those changes.

thomasegense commented 3 years ago

I completely agree with you. There is a hard timeconsuming task with only minor benefits to playback in solrwayback. (Also the url max length of 2048 also has to be changed.)

thomasegense commented 1 year ago

Here is latest work going on between Ilya and Alex: https://github.com/webrecorder/specs/blob/issue-141-post-canonicalization/post-canonicalization/latest/index.md

ato commented 1 year ago

I have a Java implementation of pywb compatible POST/PUT request body encoding here: https://github.com/iipc/jwarc/blob/master/src/org/netpreserve/jwarc/cdx/CdxRequestEncoder.java