Open ross-spencer opened 6 years ago
I think jhove is correct. The WARC-Target-URI
shouldn't have angle brackets. See also: https://github.com/webrecorder/webrecorder-player/issues/57 and https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-target-uri
oh no, sorry, I am wrong. uri
needs to have the angle brackets.
Hi @nvanderperren. I assume this means you've selected your issue. Thanks very much and have assigned you.
If I may add some additional hints (sorry for being late, I was not aware of this issue):
The WARC 1.0 specification, ISO 28500:2009 is inconsistent regarding the use of encapsulating angle brackets. In sections 4, 5.2 and 5.12 these definitions can be found which do use angle brackets:
WARC-Record-ID = "WARC-Record-ID" ":" uri
WARC-Target-URI = "WARC-Target-URI" ":" uri
uri = "<" <'URI' per RFC3986> ">"
The examples in Annex C1. and C.2 of the same specification on the other hand use them only inconsistently:
WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
IMHO this inconsistency makes strict validation against the WARC 1.0 spec impossible.
The mess created by the WARC 1.0 spec becomes clearer when looking at different tools. Heritrix (3.3.0-SNAPSHOT-2018-01-25T00:33:21Z) development seems to have been guided by the examples rather than the formal specification:
WARC/1.0
WARC-Record-ID: <urn:uuid:1ffbefe8-dfe1-4f4a-9212-af7aca32dd99>
WARC-Target-URI: https://www.example.org/
Wget behaved completely weird in earlier versions (like 1.18 and 1.19.5):
WARC/1.0
WARC-Record-ID: <urn:uuid:134cafed-9cbc-41dc-b709-e843003db49d>
WARC-Target-URI: <https://example.org/>
WARC-Target-URI: https://example.org/
The first WARC-Target-URI in this example is from a request record, the second from a response record ... Luckily this was fixed in version 1.20.1 (or maybe a little earlier) where wget follows the formal spec, not the examples (other than Heritrix!):
WARC/1.0
WARC-Record-ID: <urn:uuid:134cafed-9cbc-41dc-b709-e843003db49d>
WARC-Target-URI: <https://example.org/>
JHOVE (tested with 1.20.1) accepts the WARC syntax written by Heritrix but not wget, so seems to follow the examples instead of formal spec as well.
Luckily, the WARC 1.1 specification, ISO 28500:2017 fixes this inconsistency in section 4 by adapting the formal spec to the examples (and to common practice, see this pull request):
WARC-Record-ID = "WARC-Record-ID" ":" "<" uri ">"
WARC-Target-URI = "WARC-Target-URI" ":" uri
uri = <'URI' per RFC3986>
With consistent examples in Annex B.1 and B.2:
WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
I'm not saying this is prettier, but at least it is consistent. ;-) The specification even adds an explanation in section 4:
NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as "<" <'URI' per RFC 3986> ">". This rule has been changed to meet requests from implementers.
Everybody should switch to WARC 1.1! :-) And validation of WARC 1.0 has to be quite lax regarding angle brackets.
I thought I couldn't reproduce this, but apparently Jhove doesn't automatically recognize WARC-files :)
I also have a lot of incorrect block digest
errors with WARC's I created with Webrecorder, but I guess that's another issue.
@marhop, it seems that Jhove validates only on ISO 28500:2009, but incorrect then :)
Thanks very much for the intervention @marhop, even if it doesn't make life much easier in a practical sense. Your point regarding 1.0 validation is a good one and I'm happy to take the validation suggestion from someone who's better informed than I am. Unless I've been hard of understanding, it's clear that JHOVE shouldn't be failing this case, but you'd suggest remaining lenient where these aren't present?
Yes, exactly.
Personally and in theory, I would consider only the formal specification valid, not the examples from the annex. But this is hardly useful in practice because it would render all WARCs created by Heritrix invalid.
So I think the most sensible approach is to accept WARC-Target-URI fields (and maybe other URI fields where necessary?) both with and without surrounding angle brackets. This accounts for both formal specification and examples and for both Heritrix and wget. (I haven't looked into Webrecorder output yet.)
Excellent, and thanks very much @marhop. Nice work @nvanderperren, does this sound a feasible fix? Thanks also.
that seems feasible, but it's validated by the org.jwat.warc
-package. The plugin is outdated. I'll upgrade it to 1.1.1. and will check if it validates the WARC-Target-URI fields both with and without surrounding angle brackets. I can't find documentation in the org.jwat.warc
package that gives me an idea if it accepts both.
Just to be sure you've seen this @nvanderperren do look at #388 please then.
oh ok, what should I do then? Don't really how to solve this :)
okay, updating it to 1.1.1.
doesn't solve the error. I've added a test and it's not passed.
What I did
added a test
@Test
public void parseValidWarcFileRequestWithAngleBrackets() throws Exception {
File warcFile = new File("src/test/resources/warc/valid-warcfile-requestwithanglebrackets.warc");
generalWellFormedChecks(warcFile);
}
java.lang.AssertionError: expected:<1> but was:<0>
at edu.harvard.hul.ois.jhove.module.WarcModuleTest.wellFormedCheck(WarcModuleTest.java:1039)
at edu.harvard.hul.ois.jhove.module.WarcModuleTest.generalWellFormedChecks(WarcModuleTest.java:1030)
at edu.harvard.hul.ois.jhove.module.WarcModuleTest.parseValidWarcFileRequestWithAngleBrackets(WarcModuleTest.java:888)
Added tests starting from this line: https://github.com/nvanderperren/jhove/blob/985dec29c4eed40022953b37c8d7887b590b8c1e/jhove-ext-modules/src/test/java/edu/harvard/hul/ois/jhove/module/WarcModuleTest.java#L883
I made a test and added test files for both wget in earlier versions (WARC-Target-URI only between angle brackets in requests) and for wget > 1.20 (WARC-Target-URI always between angle brackets). Both tests fail.
@nvanderperren oops, I haven't got the time to investigate now but will take a look before the day's out tomorrow.
In het meanwhile discovered the lines in JWAT that causes this error:
// URI_NAKED means that angle brackets are not allowed
// URI_LTGT means that angle brackets are required
warcTargetUriUri = fieldParsers.parseUri(fieldValue, URI_NAKED, warcTargetUriProfile, WarcConstants.FN_WARC_TARGET_URI);
if (uriStrClean != null && uriStrClean.length() != 0) {
int fIdx = 0;
int tIdx = uriStrClean.length();
if (uriStrClean.startsWith("<")) {
ltGtBf |= 2;
++fIdx;
}
if (uriStrClean.endsWith(">")) {
ltGtBf |= 1;
--tIdx;
}
if (ltGtBf != 0) {
uriStrClean = uriStrClean.substring(fIdx, tIdx);
}
// not here
if (bLtGt) {
switch (ltGtBf) {
case 2:
addInvalidExpectedError("'" + field + "' value", uriStr, "Missing trailing '>' character");
break;
case 1:
addInvalidExpectedError("'" + field + "' value", uriStr, "Missing leading '<' character");
break;
case 0:
addInvalidExpectedError("'" + field + "' value", uriStr, "Missing encapsulating '<' and '>' characters");
break;
case 3:
default:
break;
}
} else {
// HERE!
switch (ltGtBf) {
case 2:
addInvalidExpectedError("'" + field + "' value", uriStr, "Unexpected leading '<' character");
break;
case 1:
addInvalidExpectedError("'" + field + "' value", uriStr, "Unexpected trailing '>' character");
break;
case 3:
addInvalidExpectedError("'" + field + "' value", uriStr, "Unexpected encapsulating '<' and '>' characters");
break;
case 0:
default:
break;
}
}
Dev Effort
2D
Description
Described in more detail here for JHOVE 1.17 and tested in 1.18 (WARC module didn't change). See Andy Jackson's comment further down the page.
https://gist.github.com/ross-spencer/ad3c31388683a2e411af6e21fd7a3547
Wget WARC example attached: opftest.warc.gz