iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
97 stars 28 forks source link

WARC-Protocol field proposal #42

Open ato opened 6 years ago

ato commented 6 years ago

Motivation:

WARC-Protocol field definition

The WARC-Protocol field denotes the protocol of the original network message this record holds information about.

WARC-Protocol = "WARC-Protocol" ":" protocol-id
protocol-id = "dns"      ; DNS [RFC 1035]
            | "ftp"      ; FTP [RFC 959]
            | "gemini"   ; Gemini
            | "gopher"   ; Gopher [RFC 1436]
            | "http/0.9" ; HTTP/0.9
            | "http/1.0" ; HTTP/1.0 [RFC 1945]
            | "http/1.1" ; HTTP/1.1 [RFC 7230]
            | "h2"       ; HTTP/2 over TLS [RFC 7540]
            | "h2c"      ; HTTP/2 over cleartext TCP [RFC 7540]
            | "h3"       ; HTTP/3 [RFC 9114]
            | "spdy/1"   ; SPDY/1
            | "spdy/2"   ; SPDY/2
            | "spdy/3"   ; SPDY/3
            | "ssl/2"    ; SSLv2 aka SSL 0.2
            | "ssl/3"    ; SSLv3 aka SSL 3.0 [RFC 6101]
            | "tls/1.0"  ; TLS 1.0 [RFC 2246]
            | "tls/1.1"  ; TLS 1.1 [RFC 4336]
            | "tls/1.2"  ; TLS 1.2 [RFC 5246]
            | "tls/1.3"  ; TLS 1.3

If the protocol you wish to record is not on the list above please file an issue to propose a protocol identifier before using it.

The WARC-Protocol field may be omitted when the protocol is unknown or can be unambiguosly determined from some combination of the scheme portion of the WARC-Target-URI field, the Content-Type field and the message in the record block itself.

Multiple WARC-Protocol fields may be present to indicate protocol layering. For example HTTP/1.1 over TLS 1.0 would be indicated by:

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0

The WARC-Protocol field does not indicate the format of the record block and is not a replacement for the Content-Type field. Different protocols may reuse the same media type. There are also situations where it may be desirable to represent the same message of a particular protocol using different types such as semantically equivalent text and binary forms.

The WARC-Protocol field may be used in 'request', 'response', 'resource', 'metadata' and 'revisit' records and shall not be used in 'warcinfo', 'conversion' and 'continuation' records.

Determining the protocol in the absence of WARC-Protocol

URI Scheme Content-Type Header version Protocol
dns text/dns dns ; transport unknown
ftp ftp ; over cleartext TCP
gemini application/gemini † gemini ; over TLS #85
gopher application/gopher † gopher ; over cleartext TCP
http application/http absent http/0.9 ; over cleartext TCP
http application/http "HTTP/1.0" http/1.0 ; over cleartext TCP
http application/http "HTTP/1.1" http/1.1 ; over cleartext TCP
https application/http "HTTP/1.0" http/1.0 ; over TLS
https application/http "HTTP/1.1" http/1.1 ; over TLS

† Not a registered media type but has been used in the wild.

When the WARC-Protocol field is present it takes precedence over the rules in the table above.

Edit 2023-05-31: Added 'revisit' to list of allowed records. Edit 2023-06-01: Added Gemini protocol as proposed by @acidus99 in #85. Edit 2023-06-02: Added Gopher protocol as proposed by @TheTechRobo in #87. Edit 2024-07-15: Added h3 (HTTP/3)

nlevitt commented 6 years ago

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

nlevitt commented 6 years ago

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

ato commented 6 years ago

Maybe we could say, please file a github issue here to propose a new protocol id, before you use it.

I think that's a great idea. I've updated the proposal text to include a link to an issue template.

ato commented 6 years ago

h2c and h2 are obvious odd ones out in the list as they don't follow the general name/version form and h2 vs h2c is somewhat redundant with specifying the TLS version. I did it that way for consistency with the identifiers the RFC itself says to use in the HTTP Upgrade header and the ALPN protocol identifier field.

Also I just made up the TLS protocol identifiers as I couldn't find anything semi-official. "SSLv3", "TLSv1.1" etc seems somewhat common in software though (Java, OpenSSL) so I can see an argument that might be a better choice. I don't think there's a right answer here, the slash form is better in the sense that you could consistently chop the version off. The "TLSvX" form is better in that you might not have to convert from whatever TLS library you're using says. I couldn't see one argument as particularly more compelling than the other so just picked one.

ato commented 5 years ago

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

In favour of a single field in the style of User-Agent:

In favour of repeated fields:

acidus99 commented 1 year ago

I have a question on which record types the WARC-Protocol header, as well as the WARC-TLS-Cipher-Suite header mentioned/proposed by @ato here should appear.

The most similar, already defined header I could think of to this is WARC-IP-Address. Section 5.10 of the 1.1 spec says "the numeric Internet address contacted to retrieve any included content" and can be associated with request and response records. But all the examples in the spec only show the WARC-IP-Address header on response records, and I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

(Which is kind of weird if you think about it from an order-of-operations perspective. The IP address of the system must be known before the request is made, so it's odd that the convention is to include the WARC-IP-Address header on response instead of the request.)

It feels like the WARC-Protocol and WARC-TLS-Cipher-Suite headers should go where the WARC-IP-Address header goes, but I really am curious to the community's feedback.

JustAnotherArchivist commented 1 year ago

I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

Here are some tools that do: wget, wpull, qwarc, Zeno, warcio (at least when using warcio.capture_http). I'm sure there are more. Heritrix and warcprox don't. If you want some real-world example WARCs, the ArchiveTeam collection on the Internet Archive is full of them.

I think that they should be allowed on both request and response records. As for why you might want to record it on the request record: consider the case where you send a request but never receive a response. It is still worth recording this attempted request (and note the lack of a response in the log accompanying the crawl), including the relevant details like IP and protocol.

ato commented 1 year ago

I just realized I missed the 'revisit' record type in the WARC-Protocol proposal, so have edited it to be included. After this edit WARC-Protocol is allowed on the same record types as WARC-IP-Address (‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’).

Some reasons for allowing it on multiple record types:

it's odd that the convention is to include the WARC-IP-Address header on response instead of the request

It's likely because:

  1. The older ARC file format did not store the request but did store the IP address.
  2. Before the advent of browser-based crawling, request records were usually completely ignored and not indexed for replay. So if you're going to put it in just one record then choosing the response record would make it more easily accessible to replay tools.
acidus99 commented 1 year ago

Excellent, thanks for the context. I ended up including them on both request and response records

ikreymer commented 1 month ago

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

It seems like this hasn't been decided one way or another, but would very much be in favor of a single field, as that makes representing WARC headers as dictionary object much easier and more concise. Are there other WARC headers that allow repetition currently?

The repeatable Set-Cookie and Link HTTP headers require special parsing, but also have custom semantics that make sense to have separate. As this is much simpler header, I think a comma-separated value list makes a lot of sense, in line with other headers like Accept*, Vary, etc...

ato commented 1 month ago

Are there other WARC headers that allow repetition currently?

WARC-Concurrent-To is the only one in the standard:

As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same WARC record.

The only other standard headers that would seem to make sense to repeat are the payload/block digest headers for different algorithms. But that's not allowed currently.

Repetition of extension headers was also discussed in #95. I haven't seen any other extension headers in the wild that use repetition or comma separated lists so far.

It's not WARC record headers but Heritrix uses repeated fields in application/warc-field metadata records to record extracted links.

wumpus commented 1 month ago

I'm in favor of a single field, comma-separated.

Note that the clock has pretty much ticked out on this discussion... the minute that a large web player starts discriminating against crawling with http/1.1 and less so against crawling with http/2, we have to switch immediately.