iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

disable serviceworker in replay proxy mode #69

Closed sberequek closed 1 year ago

sberequek commented 1 year ago

Hi,

when running jwarc as a replay proxy is there a way to disable the serviceworker script injection? Looking at the source code in the WarcServer class I would like to know if it was possible to add a parameter in get request for the "replay" which allows to change the value of the "proxy" argument. Currently the replay method is call always with "proxy" at false (line 112).

Thanks

ato commented 1 year ago

Proxy requests are handled by the proxy(HttpExchange) method which calls replay() with the proxy argument set to true.

https://github.com/iipc/jwarc/blob/5cbdd3d0e6683678d2f658a9fe2b4951b403cc7c/src/org/netpreserve/jwarc/net/WarcServer.java#L73

Proxy requests can be distinguished from normal requests by exchage.request().target() being an absolute URL (currently this is done just by proxy being the default fallthrough route).

Note that jwarc's WarcServer hasn't been well tested and lacks important features like a proper index and date-selection UI in proxy mode. It's more of a proof of concept / demo. I would currently recommend pywb's proxy mode instead for most users.

ato commented 1 year ago

Demonstration that script injection doesn't happen when used in proxy mode:

$ jwarc fetch http://www.example.org/ > /tmp/example.warc
$ jwarc serve /tmp/example.warc &
Listening on port 8080
$ curl --proxy http://localhost:8080 http://www.example.org/
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

But does when used in normal replay mode:

$ curl http://localhost:8080/replay/20230101000000/http://www.example.org/
<!doctype html><script src='/__jwarc__/inject.js'></script>
sberequek commented 1 year ago

Thanks @ato,

I fixed it, perfect thanks for the tips.