iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 8 forks source link

jwarc failing to read Http body at the second pass of a file #80

Closed vlofgren closed 11 months ago

vlofgren commented 11 months ago

I'm running into an issue where the WarcReader behaves differently when reading the same file twice. It seems there is some global state that is not appropriately cleaned up perhaps. In the second pass, http().body().isOpen() returns false, when it returned true in the first pass; and attempting to read the data fails.

I'm on jwarc 0.28.4.

This is the smallest test case I've been able to produce:

    @Test
    public void spookyActionAtADistance() throws IOException {
        Path tempFile = Files.createTempFile("test", ".warc");
        try {
            StringBuilder sb = new StringBuilder();

            sb.append("WARC/1.0\r\n");
            sb.append("WARC-Type: revisit\r\n");
            sb.append("WARC-Target-URI: https://www.example.com/\r\n");
            sb.append("WARC-Date: 2023-12-12T20:31:13Z\r\n");
            sb.append("WARC-Record-ID: <urn:uuid:ee40237d-5fb4-41a7-aa2e-bfb67c57f5f0>\r\n");
            sb.append("Content-Length: 134\r\n");
            sb.append("Content-Type: application/http;msgtype=response\r\n");
            sb.append("WARC-Payload-Digest: sha1:KC62VKWCVTBH5CT636VIFBP3KTLP4UZM\r\n");
            sb.append("WARC-Block-Digest: sha1:T2C4UJZP7PHK75RWHCID3GDDKWX5J5UX\r\n");
            sb.append("\r\n");
            sb.append("HTTP/1.1 200 OK\r\n");
            sb.append("Content-Type: text/html\r\n");
            sb.append("Content-Length: 45\r\n");
            sb.append("Content-Encoding: UTF-8\r\n");
            sb.append("\r\n");
            sb.append("<?doctype html><html><body>test</body></html>");

            Files.writeString(tempFile, sb.toString());

            try (var reader = new WarcReader(tempFile)) {
                for (var record : reader) {
                    WarcRevisit revisit = (WarcRevisit) record;
                    var http = revisit.http();
                    try (var body = http.body()) {
                        System.out.println(body.isOpen()); // prints true
                        body.stream().readAllBytes(); // works fine
                    }
                }
            }

            try (var reader = new WarcReader(tempFile)) {
                for (var record : reader) {
                    WarcRevisit revisit = (WarcRevisit) record;
                    var http = revisit.http();
                    try (var body = http.body()) {
                        System.out.println(body.isOpen());  // prints false
                        body.stream().readAllBytes();  // exception !!
                    }
                }
            }

        }
        finally {
            Files.deleteIfExists(tempFile);
        }
    }
ato commented 11 months ago

Oops. Thanks for reporting this! The same empty MessageBody instance was being reused which of course breaks if it gets closed.

There might be a secondary issue here in that jwarc currently assumes all revisit records have no payload but perhaps that's a false assumption.

ato commented 11 months ago

Fix released as v0.28.5. It should sync to maven central in a couple of hours.

vlofgren commented 11 months ago

Thanks for the quick fix :D

I'll work around the revisit limitations now that I know they exist.