iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
47 stars 8 forks source link

Write embedded resources to WARC #2

Closed machawk1 closed 5 years ago

machawk1 commented 5 years ago

Great to see work on this @ato!

I am using using jwarc 0.3.0 .jar release and noticed only the root page is included. Perhaps this is by design. If not:

For example, java -jar jwa-0.3.0.jar fetch https://www.cs.odu.edu/~mkelly/ > example.warc does not capture any embedded images, CSS, etc.

The WARC is replayable in a few replay systems (e.g., OpenWayback, Webrecorder Player) but does not appear to be replayable in the embedded one.

I tried to replay this WARC using the included java -jar jwa-0.3.0.jar serve example.warc but received a Service Unavailable in the browser when accessing http://localhost:8080

ato commented 5 years ago

Thanks for trying it out. Yeah, it's not meant to be a complete crawler, just a building block, so it only saves the single specified resource without subresources. I should document that to make it clear though.

I am planning to include an API for a WARC recording proxy server in a later release though. This will include a demo command in the CLI tool which can use headless Firefox or Chrome to collect subresources.

Odd that you're getting service unavailable. Can I ask which OS and browser you're using?

machawk1 commented 5 years ago

macOS 10.14.2 and Chrome 72.

I have yet to re-test this by building from source, I was using the release. If you are unable to replicate the Service Unavailable response, I can look into it further.

ato commented 5 years ago

Ah. jwarc doesn't parse chunked encoding yet (#1) so when it injects the javascript for replay it clobbers the chunked header resulting in a misleading 'Service unvailable' error in the browser.

As a temporary workaround jwarc-0.4.0's fetch command now uses a HTTP/1.0 request so the server doesn't use chunked encoding. Obviously that's not a good solution though as WARCs created with other tools will still have it.

ato commented 5 years ago

jwarc 0.5.0 is released and includes an experimental command which can capture a full page including subresources using headless Chrome:

#export BROWSER=/opt/google/chrome/chrome
java -jar jwarc-0.5.0.jar record https://www.cs.odu.edu/~mkelly/ > example.warc

This is just a demo really. A proper browser harness is well beyond the intended scope of jwarc. I'm probably going to make the recoding proxy available as a Java API in a future release though as it's something quite useful for building more sophisticated tools on top of.

machawk1 commented 5 years ago

@ato I am attempting to test out 0.5.0 but am having an issue getting jwarc to recognize my local Chrome on macOS 10.14.2. The link you provided in the README seems to indicate that /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome is the binary that corresponds to /opt/google/chrome/chrome in your example.


$ #export BROWSER=/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
$ java -jar jwarc-0.5.0.jar record https://www.cs.odu.edu/~mkelly/ > example.warc
WarcRecorder listening on localhost/127.0.0.1:62097
google-chrome --headless --disable-gpu --disable-breakpad --ignore-certificate-errors --proxy-server=localhost:62097 --hide-scrollbars https://www.cs.odu.edu/~mkelly/
Exception in thread "main" java.io.IOException: Cannot run program "google-chrome": error=2, No such file or directory
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
    at org.netpreserve.jwarc.WarcTool.runBrowser(WarcTool.java:223)
    at org.netpreserve.jwarc.WarcTool.access$200(WarcTool.java:21)
    at org.netpreserve.jwarc.WarcTool$Command$4.exec(WarcTool.java:109)
    at org.netpreserve.jwarc.WarcTool.main(WarcTool.java:29)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
    at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
    at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
    ... 5 more
ato commented 5 years ago

Uncomment the setting of the BROWSER environment variable by removing the '#' character. Sorry, I was trying to indicate it was optional to set it, but on second thought having it commented out is probably just confusing.