seanmonstar / httparse

A push parser for the HTTP 1.x protocol in Rust.
https://docs.rs/httparse
Apache License 2.0
567 stars 111 forks source link

[Question] Why Is the HTTP Version Limited to 1.1? #150

Open commonsensesoftware opened 5 months ago

commonsensesoftware commented 5 months ago

First, thanks for a great library. This is exactly what I was looking for. In my use case, I'm parsing nested HTTP messages in a multipart document.

I've looked at the code, comments, and previous issues. For the life of me, I cannot figure out or understand why this library is limited explicitly to HTTP/1.1. I also noticed that it will parse HTTP/1.0, but it won't parse HTTP/1, which is off-spec with the latest HTTP revision in RFC 9110§2.5. It has only been ~18 months since that specification was ratified so perhaps you just weren't privy to it. For quick reference, the specification now says:

When a major version of HTTP does not define any minor versions, the minor version "0" is implied. The "0" is used when referring to that protocol within elements that require a minor version identifier.

That means that HTTP/1 should be allowed.

In my specific case, I expect all top level messages to be HTTP/2. That level doesn't go through parsing so that's not a problem. However, I would expect nested HTTP messages to use the same version as the root level. I could just accept the nested message have to be HTTP/1.0 or HTTP/1.1, but that seems a bit odd and specific. I'd also hate to fork the repo just for this change. It feel common and adhering to the spec appears to be a design goal.

It's also worth mentioning that this library won't parse HTTP/0.9 either, which is still something supported by web servers (ex: hyper).

I don't mind contributing with a PR. I wanted to make sure I understood if there was a specific reason for this behavior. Furthermore, I wanted to get agreement before submitting anything. Ultimately, the HTTP version should probably be f32 or two u8 components. I have a few ideas and how to deal with back-compat, but I'll wait for the reply.

Thanks!

seanmonstar commented 5 months ago

Hey there! Thanks for the write-up.

When a major version of HTTP does not define any minor versions, the minor version "0" is implied.

This is saying when a specification defines a version, such as HTTP/2, if it doesn't define it with a minor version, then it can be treated as HTTP/2.0 when required. However:

So, if you look at RFC 9112§2.3, it explicitly defines how the version must be parsed:

  HTTP-version  = HTTP-name "/" DIGIT "." DIGIT
  HTTP-name     = %s"HTTP"

Therefore, a message that starts like GET / HTTP/1\r\n is not valid.

I would expect nested HTTP messages to use the same version as the root level. I could just accept the nested message have to be HTTP/1.0 or HTTP/1.1, but that seems a bit odd and specific.

I don't quite understand what you're referring to with this, sorry.

It's also worth mentioning that this library won't parse HTTP/0.9 either.

That's correct.

commonsensesoftware commented 5 months ago

I suppose that's fair. RFC 9110 is the new spec for all of HTTP, but specs like RFC 9112 update the existing specs for older versions. Since older specs require <major>.<minor>, it makes sense that the minor version would be required since it would break older clients. RFC 9110 should indicate that the new semantics only apply to HTTP/2 and beyond.

Let me outline my scenario a bit more for you. I'm parsing multipart/mixed messages. Sadly, there are no existing crates for this. All of the existing crates only support multipart/form-data (which is a common, but limited use case). A message would look like this:

POST /batch HTTP/2
Content-Length: 420
Content-Type: multipart/mixed; boundary=9036ca8fc2f1473091f5ed273ef1b472

--9036ca8fc2f1473091f5ed273ef1b472
Content-Type: application/http; msgtype=request
Content-Length: 120

POST /item HTTP/2
Content-Type: application/json
Content-Length: 42

{"id":42,"name":"Example 1"}
--9036ca8fc2f1473091f5ed273ef1b472
Content-Type: application/http; msgtype=request
Content-Length: 120

POST /item HTTP/2
Content-Type: application/json
Content-Length: 42

{"id":43,"name":"Example 2"}
--9036ca8fc2f1473091f5ed273ef1b472
Content-Type: application/http; msgtype=request
Content-Length: 120

POST /item HTTP/2
Content-Type: application/json
Content-Length: 42

{"id":44,"name":"Example 3"}
--9036ca8fc2f1473091f5ed273ef1b472--

I have all the functionality built to parse the parts, but it doesn't make sense to create my own parser for the HTTP request message.

I would expect nested HTTP messages to use the same version as the root level. I could just accept the nested message have to be HTTP/1.0 or HTTP/1.1, but that seems a bit odd and specific.

I don't quite understand what you're referring to with this, sorry.

It makes logical sense that a client would use/generate whatever HTTP version they are currently connected to. If they connect with HTTP/2 or otherwise know that they will, it's more than reasonable to expect that they would use HTTP/2. For my purposes, and in general, it doesn't really matter which HTTP version is specified at this point. I can workaround the issue by forcing the start line to always use HTTP/1.0 for a part that is application/http; msgtype=request. This requirement is, unfortunately, not that obvious to clients and there doesn't seem to be an escape hatch for parsing other HTTP versions.

This library covers the 99% of use cases that people are interested in which is a high perf parse of the start line, headers, and start of the body. It doesn't, nor is intended to, deal with connections, negotiations, prologue, epilogue, or even trailer headers (I believe). All that being said, I believe it is possible to support the same parsing semantics for a message that says it is HTTP/2 , HTTP/3, or some future version without sacrificing any of the perform it already has. Supporting it would enable potential callback behavior for edge cases or simply put full control in the user's hands as to what special handling is required for other HTTP versions, if any.

Ultimately, there doesn't appear to be a compelling reason to not support parsing any version. I can't think of any other version-specific parsing behavior required, but limiting that to the behavior of HTTP/1.1 seems more than reasonable now and - perhaps - forever. Since version: u8 already exists as a member, I would suggest adding a full_version or http_version as a simple f32 or something like:

use std::fmt::{Debug, Display, Formatter, Result as FormatResult};

#[derive(Clone, Copy, Eq, Ord)]
pub struct HttpVersion(u8,Option<u8>);

impl HttpVersion {
    pub fn new(major: u8, minor: Option<u8>) -> Self {
        Self(major, minor)
    }

    pub fn major(&self) -> u8 {
        self.0
    }

    pub fn minor(&self) -> u8 {
        self.1.unwrap_or_default()
    }
}

impl PartialEq for HttpVersion {
    fn eq(&self, other: &Self) -> bool {
        self.0 == other.0
        && self.1.unwrap_or_default() == other.1.unwrap_or_default()
    }
}

impl PartialOrd for HttpVersion {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        match self.0.partial_cmp(&other.0) {
            Some(Ordering::Equal) =>
                self.1.unwrap_or_default()
                      .partial_cmp(&other.1.unwrap_or_default()),
            ord => return ord,
        }
    }
}

impl Display for HttpVersion {
    fn fmt(&self, f: &mut Formatter<'_>) -> FormatResult {
        self.0.fmt(f)?;

        if let Some(minor) = self.1 {
            f.write('.')?;
            minor.fmt(f)?;
        }
    }
}

impl Debug for HttpVersion {
    fn fmt(&self, f: &mut Formatter<'_>) -> FormatResult {
        f.debug_struct("HttpVersion")
         .field("major", &self.0)
         .field("minor", &self.1.unwrap_or_default())
         .finish()
    }
}

This approach would still be inline with avoiding allocations since everything would still be on the stack.

The logical way to think about this approach (at least in my small, 🐿️ 🧠) is the parse message can be any HTTP version, but the parsing semantics are [currently or forever] restricted to HTTP 1.1.

Thoughts?

seanmonstar commented 5 months ago

Oh, I see what you mean now by nested messages, multipart/mixed. That made it all click, I knew I must have been missing something.

So, on one level, I don't believe that is valid HTTP/2, since HTTP/2 defines a binary protocol with HEADERS and DATA frames.

Ultimately, there doesn't appear to be a compelling reason to not support parsing any version.

The reason so far is that HTTP/1.0 and HTTP/1.1 have specifically defined parsing rules, and the parser doesn't "know" about the rules for any other version.

commonsensesoftware commented 3 months ago

Apologies for the long delay in continuing the conversation.

With respect to the underlying transport mechanisms of HTTP, I completely agree with you; however, that is not where this library sits (as far as I can see). This library can absolutely handle parsing and processing content in total or as it is streamed in, but the specific HTTP transports are handled by some underlying, lower level crate; for example, hyper.

To further my stance, this library does not support parsing chunks or trailing headers which are part of the HTTP/1.1 spec. I'm not saying that it should. I'm saying it doesn't need to. A lower-level crate, such as hyper, would handle that and stream the decoded byte stream in which can then be parsed by this library.

I don't believe this library is meant to parse content off of a raw socket; therefore, how the underlying transport mechanics, be it binary headers and/or data frames, is irrelevant. With a very small modification, this library could parse any HTTP message from an incoming/outgoing stream provided by a lower-level library or from another source such as a file.

I might be missing something else so feel free to correct me. There are a few minor variances in HTTP message format by version. The most obvious is the response status reason, which you already handle because it's been optional for a while, but now it's unused. Other, obscure differences such as line folding and trailers are perhaps interesting, but probably not common or strictly necessary. If that was supported some day, it would be a nice-to-have.

This is an impressive crate and I'd ❤️ to see its reach extend even further, if not just for my own selfish reasons. 😄