iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Should WARC-Identified-Payload-Type be ignored if specified? #84

Closed nurhafiz closed 1 year ago

nurhafiz commented 1 year ago

Hi,

The specs state the following about WARC-Identified-Payload-Type:

The content-type of the record’s payload as determined by an independent check.

https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#warc-identified-payload-type

I'm wondering what a parser should do if it encounters a record that contains the header:

1) Ignore it because its value should be automatically derived based on the payload.

2) Parse and return as-is even if its value differs from the one derived from an auto detector, if any?

Thanks in advance.

ato commented 1 year ago

I don't think the WARC standard can provide guidance in this area as this seems more a question about the design of a parser library rather than the WARC file format itself.

Personally, if I were designing a general library for reading WARC files I would choose to include both an API to return the WARC-Identified-Payload-Type value from the file and a separate API to invoke the auto detector (if I was including one). This would allow the application author to choose which value to use based on what they're trying to do and their knowledge of the data provenance. Maybe they would want to trust the WARC-Identified-Payload-Type value for records from sources that use an identification method known to be accurate but ignore it and redo the auto detection for WARC records from other sources known to use a worse identification method.

ato commented 1 year ago

I found more context to your question in a mentioning issue.

I took into account this from the specs:

The content-type of the record’s payload as determined by an independent check. This string shall not be arrived at by blindly promoting an HTTP Content-Type value up from a record block into the WARC header without direct analysis of the payload, as such values may often be unreliable.

(Emphasis mine)

iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-identified-payload-type

Hence, I assumed that auto detection must be done. What do you think?

This wording is intended to guide the behavior of the software that's creating the WARC-Identified-Payload-Type header and outputting it in a WARC file. A reader of a WARC file is of course free to make use of the value. After all it'd be pretty pointless to have a header that a parser is forced to ignore the value of. :-)

nurhafiz commented 1 year ago

Thanks for the clarification and suggestions.