openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
161 stars 78 forks source link

WARC Invalid Error: Unexpected encapsulating '<' and '>' characters" #294

Open ross-spencer opened 6 years ago

ross-spencer commented 6 years ago

Dev Effort

2D

Description

Described in more detail here for JHOVE 1.17 and tested in 1.18 (WARC module didn't change). See Andy Jackson's comment further down the page.

https://gist.github.com/ross-spencer/ad3c31388683a2e411af6e21fd7a3547

Wget WARC example attached: opftest.warc.gz

nvanderperren commented 4 years ago

I think jhove is correct. The WARC-Target-URI shouldn't have angle brackets. See also: https://github.com/webrecorder/webrecorder-player/issues/57 and https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-target-uri

nvanderperren commented 4 years ago

oh no, sorry, I am wrong. uri needs to have the angle brackets.

carlwilson commented 4 years ago

Hi @nvanderperren. I assume this means you've selected your issue. Thanks very much and have assigned you.

marhop commented 4 years ago

If I may add some additional hints (sorry for being late, I was not aware of this issue):

WARC 1.0

The WARC 1.0 specification, ISO 28500:2009 is inconsistent regarding the use of encapsulating angle brackets. In sections 4, 5.2 and 5.12 these definitions can be found which do use angle brackets:

WARC-Record-ID = "WARC-Record-ID" ":" uri
WARC-Target-URI = "WARC-Target-URI" ":" uri
uri = "<" <'URI' per RFC3986> ">"

The examples in Annex C1. and C.2 of the same specification on the other hand use them only inconsistently:

WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
WARC-Target-URI: http://www.archive.org/images/logoc.jpg

IMHO this inconsistency makes strict validation against the WARC 1.0 spec impossible.

Examples from tools

The mess created by the WARC 1.0 spec becomes clearer when looking at different tools. Heritrix (3.3.0-SNAPSHOT-2018-01-25T00:33:21Z) development seems to have been guided by the examples rather than the formal specification:

WARC/1.0
WARC-Record-ID: <urn:uuid:1ffbefe8-dfe1-4f4a-9212-af7aca32dd99>
WARC-Target-URI: https://www.example.org/

Wget behaved completely weird in earlier versions (like 1.18 and 1.19.5):

WARC/1.0
WARC-Record-ID: <urn:uuid:134cafed-9cbc-41dc-b709-e843003db49d>
WARC-Target-URI: <https://example.org/>
WARC-Target-URI: https://example.org/

The first WARC-Target-URI in this example is from a request record, the second from a response record ... Luckily this was fixed in version 1.20.1 (or maybe a little earlier) where wget follows the formal spec, not the examples (other than Heritrix!):

WARC/1.0
WARC-Record-ID: <urn:uuid:134cafed-9cbc-41dc-b709-e843003db49d>
WARC-Target-URI: <https://example.org/>

JHOVE (tested with 1.20.1) accepts the WARC syntax written by Heritrix but not wget, so seems to follow the examples instead of formal spec as well.

WARC 1.1

Luckily, the WARC 1.1 specification, ISO 28500:2017 fixes this inconsistency in section 4 by adapting the formal spec to the examples (and to common practice, see this pull request):

WARC-Record-ID = "WARC-Record-ID" ":" "<" uri ">"
WARC-Target-URI = "WARC-Target-URI" ":" uri
uri = <'URI' per RFC3986>

With consistent examples in Annex B.1 and B.2:

WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
WARC-Target-URI: http://www.archive.org/images/logoc.jpg

I'm not saying this is prettier, but at least it is consistent. ;-) The specification even adds an explanation in section 4:

NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as "<" <'URI' per RFC 3986> ">". This rule has been changed to meet requests from implementers.

Conclusion

Everybody should switch to WARC 1.1! :-) And validation of WARC 1.0 has to be quite lax regarding angle brackets.

nvanderperren commented 4 years ago

I thought I couldn't reproduce this, but apparently Jhove doesn't automatically recognize WARC-files :)

I also have a lot of incorrect block digest errors with WARC's I created with Webrecorder, but I guess that's another issue.

@marhop, it seems that Jhove validates only on ISO 28500:2009, but incorrect then :)

carlwilson commented 4 years ago

Thanks very much for the intervention @marhop, even if it doesn't make life much easier in a practical sense. Your point regarding 1.0 validation is a good one and I'm happy to take the validation suggestion from someone who's better informed than I am. Unless I've been hard of understanding, it's clear that JHOVE shouldn't be failing this case, but you'd suggest remaining lenient where these aren't present?

marhop commented 4 years ago

Yes, exactly.

Personally and in theory, I would consider only the formal specification valid, not the examples from the annex. But this is hardly useful in practice because it would render all WARCs created by Heritrix invalid.

So I think the most sensible approach is to accept WARC-Target-URI fields (and maybe other URI fields where necessary?) both with and without surrounding angle brackets. This accounts for both formal specification and examples and for both Heritrix and wget. (I haven't looked into Webrecorder output yet.)

carlwilson commented 4 years ago

Excellent, and thanks very much @marhop. Nice work @nvanderperren, does this sound a feasible fix? Thanks also.

nvanderperren commented 4 years ago

that seems feasible, but it's validated by the org.jwat.warc-package. The plugin is outdated. I'll upgrade it to 1.1.1. and will check if it validates the WARC-Target-URI fields both with and without surrounding angle brackets. I can't find documentation in the org.jwat.warc package that gives me an idea if it accepts both.

carlwilson commented 4 years ago

Just to be sure you've seen this @nvanderperren do look at #388 please then.

nvanderperren commented 4 years ago

oh ok, what should I do then? Don't really how to solve this :)

nvanderperren commented 4 years ago

okay, updating it to 1.1.1. doesn't solve the error. I've added a test and it's not passed.

What I did

Added tests starting from this line: https://github.com/nvanderperren/jhove/blob/985dec29c4eed40022953b37c8d7887b590b8c1e/jhove-ext-modules/src/test/java/edu/harvard/hul/ois/jhove/module/WarcModuleTest.java#L883

I made a test and added test files for both wget in earlier versions (WARC-Target-URI only between angle brackets in requests) and for wget > 1.20 (WARC-Target-URI always between angle brackets). Both tests fail.

carlwilson commented 4 years ago

@nvanderperren oops, I haven't got the time to investigate now but will take a look before the day's out tomorrow.

MartinSpeller commented 4 years ago

WARC Invalid Error: Unexpected encapsulating '<' and '>' characters" #294 - Assigned to nvanderperren

nvanderperren commented 4 years ago

In het meanwhile discovered the lines in JWAT that causes this error:

WarcHeader.java L588

// URI_NAKED means that angle brackets are not allowed
// URI_LTGT means that angle brackets are required
warcTargetUriUri = fieldParsers.parseUri(fieldValue, URI_NAKED, warcTargetUriProfile, WarcConstants.FN_WARC_TARGET_URI);

WarcFieldParser L225

if (uriStrClean != null && uriStrClean.length() != 0) {
            int fIdx = 0;
            int tIdx = uriStrClean.length();
            if (uriStrClean.startsWith("<")) {
                ltGtBf |= 2;
                ++fIdx;
            }
            if (uriStrClean.endsWith(">")) {
                ltGtBf |= 1;
                --tIdx;
            }
            if (ltGtBf != 0) {
                uriStrClean = uriStrClean.substring(fIdx, tIdx);
            }
             // not here
            if (bLtGt) {
                switch (ltGtBf) {
                case 2:
                    addInvalidExpectedError("'" + field + "' value", uriStr, "Missing trailing '>' character");
                    break;
                case 1:
                    addInvalidExpectedError("'" + field + "' value", uriStr, "Missing leading '<' character");
                    break;
                case 0:
                    addInvalidExpectedError("'" + field + "' value", uriStr, "Missing encapsulating '<' and '>' characters");
                    break;
                case 3:
                default:
                    break;
                }
            } else {
                 // HERE!
                switch (ltGtBf) {
                case 2:
                    addInvalidExpectedError("'" + field + "' value", uriStr, "Unexpected leading '<' character");
                    break;
                case 1:
                    addInvalidExpectedError("'" + field + "' value", uriStr, "Unexpected trailing '>' character");
                    break;
                case 3:
                    addInvalidExpectedError("'" + field + "' value", uriStr, "Unexpected encapsulating '<' and '>' characters");
                    break;
                case 0:
                default:
                    break;
                }
            }