iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

Do not add value of preceding HTTP header field if there is no value #74

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

If a HTTP header field is empty (or contains only white space), the HttpHeaderParser does not use the empty value but uses the value from the preceding HTTP header field. See commoncrawl/ia-web-commons#11 for an example and test data.

This PR fixes the problem and adds a unit test.

ldko commented 7 years ago

Looking at this, I think the code change is fine. However, I ran it with and without the change on the sample WARC file attached at commoncrawl/ia-web-commons#11 and found that in both cases I couldn't check the value of the Server header because the relevant records in the WAT files in both cases didn't have Actual-Content-Type or HTTP-Response-Metadata fields in the Payload-Metadata. I think this is because of this difference between your code and the iipc code regarding WARCs created with wget:

diff --git a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
index ad10be4..0afe16f 100644
--- a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
+++ b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
@@ -153,7 +153,10 @@ public class ExtractingResourceFactoryMapper implements ResourceFactoryMapper {
        private boolean isHTTPResponseWARCResource(MetaData envelope) {
                return childFieldEquals(envelope,WARC_HEADER_METADATA,
                                WARCConstants.CONTENT_TYPE,
-                               WARCConstants.HTTP_RESPONSE_MIMETYPE);
+                               WARCConstants.HTTP_RESPONSE_MIMETYPE)
+                       || childFieldEquals(envelope,WARC_HEADER_METADATA,
+                               WARCConstants.CONTENT_TYPE,
+                               WARCConstants.HTTP_RESPONSE_MIMETYPE_NS);
        }
        private boolean isWARCJSONResource(MetaData envelope) {
                return childFieldEquals(envelope,WARC_HEADER_METADATA,
diff --git a/src/main/java/org/archive/format/warc/WARCConstants.java b/src/main/java/org/archive/format/warc/WARCConstants.java
index 93a81f9..4f2fa57 100644
--- a/src/main/java/org/archive/format/warc/WARCConstants.java
+++ b/src/main/java/org/archive/format/warc/WARCConstants.java
@@ -209,7 +209,9 @@ public interface WARCConstants extends ArchiveFileConstants {
        "application/http; msgtype=request";
     public static final String HTTP_RESPONSE_MIMETYPE =
        "application/http; msgtype=response";
-    
+  public static final String HTTP_RESPONSE_MIMETYPE_NS =
+      "application/http;msgtype=response";                   // wget does this
+
     public static final String FTP_CONTROL_CONVERSATION_MIMETYPE =
         "text/x-ftp-control-conversation";

Does this seem correct to you, @sebastian-nagel ?

sebastian-nagel commented 7 years ago

Ok, I see. Of course, application/http;msgtype=request should also be covered. The fork commoncrawl/ia-web-commons contains still a couple of changes not merged into iipc/webarchive-commons, mainly the WETExtractor. I'm on course to push the remaining changes. But feel free to pull anything!

ldko commented 7 years ago

Ok, I will merge this along with the differences from WARCConstants.java and ExtractingResourceFactoryMapper.java. Thanks!