Closed machawk1 closed 5 years ago
Thanks for trying it out. Yeah, it's not meant to be a complete crawler, just a building block, so it only saves the single specified resource without subresources. I should document that to make it clear though.
I am planning to include an API for a WARC recording proxy server in a later release though. This will include a demo command in the CLI tool which can use headless Firefox or Chrome to collect subresources.
Odd that you're getting service unavailable. Can I ask which OS and browser you're using?
macOS 10.14.2 and Chrome 72.
I have yet to re-test this by building from source, I was using the release. If you are unable to replicate the Service Unavailable response, I can look into it further.
Ah. jwarc doesn't parse chunked encoding yet (#1) so when it injects the javascript for replay it clobbers the chunked header resulting in a misleading 'Service unvailable' error in the browser.
As a temporary workaround jwarc-0.4.0's fetch
command now uses a HTTP/1.0 request so the server doesn't use chunked encoding. Obviously that's not a good solution though as WARCs created with other tools will still have it.
jwarc 0.5.0 is released and includes an experimental command which can capture a full page including subresources using headless Chrome:
#export BROWSER=/opt/google/chrome/chrome
java -jar jwarc-0.5.0.jar record https://www.cs.odu.edu/~mkelly/ > example.warc
This is just a demo really. A proper browser harness is well beyond the intended scope of jwarc. I'm probably going to make the recoding proxy available as a Java API in a future release though as it's something quite useful for building more sophisticated tools on top of.
@ato I am attempting to test out 0.5.0 but am having an issue getting jwarc to recognize my local Chrome on macOS 10.14.2. The link you provided in the README seems to indicate that /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
is the binary that corresponds to /opt/google/chrome/chrome
in your example.
$ #export BROWSER=/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
$ java -jar jwarc-0.5.0.jar record https://www.cs.odu.edu/~mkelly/ > example.warc
WarcRecorder listening on localhost/127.0.0.1:62097
google-chrome --headless --disable-gpu --disable-breakpad --ignore-certificate-errors --proxy-server=localhost:62097 --hide-scrollbars https://www.cs.odu.edu/~mkelly/
Exception in thread "main" java.io.IOException: Cannot run program "google-chrome": error=2, No such file or directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
at org.netpreserve.jwarc.WarcTool.runBrowser(WarcTool.java:223)
at org.netpreserve.jwarc.WarcTool.access$200(WarcTool.java:21)
at org.netpreserve.jwarc.WarcTool$Command$4.exec(WarcTool.java:109)
at org.netpreserve.jwarc.WarcTool.main(WarcTool.java:29)
Caused by: java.io.IOException: error=2, No such file or directory
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
... 5 more
Uncomment the setting of the BROWSER environment variable by removing the '#' character. Sorry, I was trying to indicate it was optional to set it, but on second thought having it commented out is probably just confusing.
Great to see work on this @ato!
I am using using jwarc 0.3.0 .jar release and noticed only the root page is included. Perhaps this is by design. If not:
For example,
java -jar jwa-0.3.0.jar fetch https://www.cs.odu.edu/~mkelly/ > example.warc
does not capture any embedded images, CSS, etc.The WARC is replayable in a few replay systems (e.g., OpenWayback, Webrecorder Player) but does not appear to be replayable in the embedded one.
I tried to replay this WARC using the included
java -jar jwa-0.3.0.jar serve example.warc
but received a Service Unavailable in the browser when accessinghttp://localhost:8080