Open anjackson opened 3 years ago
More information from Ilya:
Hi, I am working on trying to standardize POST request indexing across all the different Webrecorder tools, and support additional improvements.. This probably calls for a write-up, but just wanted to share what the idea is so far:
the POST request data, if possible is converted to query (form-encoded) form and treated as part of the URL, in a sense, converting the POST request to a GET
this can also apply to PUT or any other requests
The CDXJ entry would look like this:
org,httpbin)/post?__wb_method=post&another=more^data&test=some+data 20200809195334 {"url": "https://httpbin.org/post", "mime": "application/json", "status": "200", "digest": "7AWVEIPQMCA4KTCNDXWSZ465FITB7LSK", "length": "688", "offset": "0", "filename": "post-test-more.warc", "requestBody": "?__wb_method=POST&test=some+data&another=more%5Edata", "method": "POST"}
the canonicalized key has this extra query appended to it, along with _wb_method
the url field is not modified
the url-encoded query form is stored in requestBody field and also an extra method field is added
the requestBody is for: application/x-www-form-urlencoded - already in this form use as is multipart/form-data - convert to url-encoded query application/json - parse the json and add each primitive to the query, eg. {"a": "b", "foo": {"c": "d"}} becomes a=b&c=d (is better approach possible) text/plain - assume it may be json, try to parse application/json , otherwise treat as binary/other binary/all other - base64 encode and add as _wb_post_data=<base 64 data>**
Thanks @thomasegense - I'm afraid I'm probably going to switch to using the PyWB indexer for now, as modifying this codebase to pull together the request and reponse records is going to mean significant changes to the way it works. I don't current have time to make those changes.
I completely agree with you. There is a hard timeconsuming task with only minor benefits to playback in solrwayback. (Also the url max length of 2048 also has to be changed.)
Here is latest work going on between Ilya and Alex: https://github.com/webrecorder/specs/blob/issue-141-post-canonicalization/post-canonicalization/latest/index.md
I have a Java implementation of pywb compatible POST/PUT request body encoding here: https://github.com/iipc/jwarc/blob/master/src/org/netpreserve/jwarc/cdx/CdxRequestEncoder.java
To get playback working, we need to make HEAD/OPTIONS/POST records like PyWB does. See https://github.com/webrecorder/pywb/issues/585 and related tickets.
It's fairly involved! https://github.com/webrecorder/pywb/blob/54d8bccf4a4eebf305012d49cb7330eaddea9eba/pywb/warcserver/inputrequest.py#L183
Will replace/supercede https://github.com/ukwa/webarchive-discovery/blob/a166803280fc62e51c4dcf4ee8acd7ac6ee38f4c/warc-hadoop-recordreaders/src/main/java/uk/bl/wa/hadoop/mapreduce/cdx/TinyCDXServerReducer.java#L86-L95
Note that to be useful, we need to upgrade to
nlagovau/outbackcdx:0.8.0
.