Closed nurhafiz closed 1 year ago
I don't think the WARC standard can provide guidance in this area as this seems more a question about the design of a parser library rather than the WARC file format itself.
Personally, if I were designing a general library for reading WARC files I would choose to include both an API to return the WARC-Identified-Payload-Type
value from the file and a separate API to invoke the auto detector (if I was including one). This would allow the application author to choose which value to use based on what they're trying to do and their knowledge of the data provenance. Maybe they would want to trust the WARC-Identified-Payload-Type
value for records from sources that use an identification method known to be accurate but ignore it and redo the auto detection for WARC records from other sources known to use a worse identification method.
I found more context to your question in a mentioning issue.
I took into account this from the specs:
The content-type of the record’s payload as determined by an independent check. This string shall not be arrived at by blindly promoting an HTTP Content-Type value up from a record block into the WARC header without direct analysis of the payload, as such values may often be unreliable.
(Emphasis mine)
iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-identified-payload-type
Hence, I assumed that auto detection must be done. What do you think?
This wording is intended to guide the behavior of the software that's creating the WARC-Identified-Payload-Type
header and outputting it in a WARC file. A reader of a WARC file is of course free to make use of the value. After all it'd be pretty pointless to have a header that a parser is forced to ignore the value of. :-)
Thanks for the clarification and suggestions.
Hi,
The specs state the following about
WARC-Identified-Payload-Type
:https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#warc-identified-payload-type
I'm wondering what a parser should do if it encounters a record that contains the header:
1) Ignore it because its value should be automatically derived based on the payload.
2) Parse and return as-is even if its value differs from the one derived from an auto detector, if any?
Thanks in advance.