Closed sebastian-nagel closed 7 years ago
Looking at this, I think the code change is fine. However, I ran it with and without the change on the sample WARC file attached at commoncrawl/ia-web-commons#11 and found that in both cases I couldn't check the value of the Server
header because the relevant records in the WAT files in both cases didn't have Actual-Content-Type
or HTTP-Response-Metadata
fields in the Payload-Metadata
. I think this is because of this difference between your code and the iipc code regarding WARCs created with wget:
diff --git a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
index ad10be4..0afe16f 100644
--- a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
+++ b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
@@ -153,7 +153,10 @@ public class ExtractingResourceFactoryMapper implements ResourceFactoryMapper {
private boolean isHTTPResponseWARCResource(MetaData envelope) {
return childFieldEquals(envelope,WARC_HEADER_METADATA,
WARCConstants.CONTENT_TYPE,
- WARCConstants.HTTP_RESPONSE_MIMETYPE);
+ WARCConstants.HTTP_RESPONSE_MIMETYPE)
+ || childFieldEquals(envelope,WARC_HEADER_METADATA,
+ WARCConstants.CONTENT_TYPE,
+ WARCConstants.HTTP_RESPONSE_MIMETYPE_NS);
}
private boolean isWARCJSONResource(MetaData envelope) {
return childFieldEquals(envelope,WARC_HEADER_METADATA,
diff --git a/src/main/java/org/archive/format/warc/WARCConstants.java b/src/main/java/org/archive/format/warc/WARCConstants.java
index 93a81f9..4f2fa57 100644
--- a/src/main/java/org/archive/format/warc/WARCConstants.java
+++ b/src/main/java/org/archive/format/warc/WARCConstants.java
@@ -209,7 +209,9 @@ public interface WARCConstants extends ArchiveFileConstants {
"application/http; msgtype=request";
public static final String HTTP_RESPONSE_MIMETYPE =
"application/http; msgtype=response";
-
+ public static final String HTTP_RESPONSE_MIMETYPE_NS =
+ "application/http;msgtype=response"; // wget does this
+
public static final String FTP_CONTROL_CONVERSATION_MIMETYPE =
"text/x-ftp-control-conversation";
Does this seem correct to you, @sebastian-nagel ?
Ok, I see. Of course, application/http;msgtype=request
should also be covered. The fork commoncrawl/ia-web-commons contains still a couple of changes not merged into iipc/webarchive-commons, mainly the WETExtractor. I'm on course to push the remaining changes. But feel free to pull anything!
Ok, I will merge this along with the differences from WARCConstants.java and ExtractingResourceFactoryMapper.java. Thanks!
If a HTTP header field is empty (or contains only white space), the HttpHeaderParser does not use the empty value but uses the value from the preceding HTTP header field. See commoncrawl/ia-web-commons#11 for an example and test data.
This PR fixes the problem and adds a unit test.