quicwg / qlog

The IETF I-D documents for the qlog format
Other
83 stars 12 forks source link

parsing the `serialization_format` field requires prior knowledge of the serialization format #435

Open marten-seemann opened 1 month ago

marten-seemann commented 1 month ago

From section 5:

<RS>{
     "file_schema": "urn:ietf:params:qlog:file:sequential",
     "serialization_format": "application/qlog+json-seq",
     "title": "Name of JSON Text Sequence qlog file (short)",
     "description": "Description for this trace file (long)",
     "trace": {
       "common_fields": {
         "protocol_type": ["QUIC","HTTP3"],
         "group_id":"127ecc830d98f9d54a42c4f0842aa87e181a",
         "time_format":"relative",
         "reference_time": 1553986553572
       },
       "vantage_point": {
         "name":"backend-67",
         "type":"server"
       }
    }
}

If I don't know that the file is serialized as JSON-SEQ, I won't be able to parse the header that tells me that it's serialized as JSON-SEQ.

What would we lose by not including the serialization_format?

LPardue commented 1 month ago

The text states

In order to make it easier to parse and identify qlog files and their serialization format, the "file_schema" and "serialization_format" fields and their values SHOULD be in the first 256 characters/bytes of the resulting log file.

So in theory, a tool could try to regex match or magic number match for a binary format, based on these things.

It is a bit of a chicken and egg thing though. Currently, my parser just looks at the file extension and then does some trial decode and a fallback.

rmarx commented 1 month ago

So, to an extent you're right @marten-seemann, you need some idea of what you're dealing with (i.e., binary vs plaintext vs something like JSON). However, only basing this on for example file extension is too naive (especially since people use .json EVERYWHERE).

The current code in qvis imo shows this well, since I want to support both netlog files (also annoyingly .json) and various JSON-based qlog variants (normal JSON, JSON-SEQ, NDJSON).

I don't want to have to "trial parse" each of those options, since the parsers might be somewhat expensive to initialize / don't necessarily work in a streaming fashion/might even have to be offloaded to the backend. Having the serialization_format in the first x chars helps me make a very educated decision using a simple string lookup on how to handle the file without having to involve a complete parser yet.

While I agree it's not ideal, it's a good practical solution that I'd like to keep. The recent switch to making it equivalent with the media types I feel is elegant and clear and consistent in ways that the previous qlog_format wasn't.

marten-seemann commented 1 month ago

The current code in qvis imo shows this well, since I want to support both netlog files (also annoyingly .json) and various JSON-based qlog variants (normal JSON, JSON-SEQ, NDJSON).

I don't want to have to "trial parse" each of those options, since the parsers might be somewhat expensive to initialize / don't necessarily work in a streaming fashion/might even have to be offloaded to the backend. Having the serialization_format in the first x chars helps me make a very educated decision using a simple string lookup on how to handle the file without having to involve a complete parser yet.

I totally get that, though I'd argue that this is not the situation we should optimize for:

LPardue commented 1 month ago

My tool also supports netlog json.

New documents could define new log formats or serialization formats.

The peeking code that Robin suggests is a simple way to accomodate for a range of future possibilities.

marten-seemann commented 1 month ago

Is there precedent for this kind of peeking logic in other IETF serialization formats, or other data formats outside of the IETF? This seems extremely hacky to me.

LPardue commented 1 month ago

Yes, this is typically referred to as "content sniffing" or "MIME sniffing" - it has drawbacks (that we should document further if we are keeping the text) but to my knowledge is commonly used to work around imperfect or incorrect metadata.

Wikipedia highlights that the unix file command is such a sniffer. See https://www.darwinsys.com/file/ and https://github.com/file/file. This relies on libmagic, which is commonly distrubted with a database of entries (e.g. at /usr/share/misc/magic.mgc). For example I can do this:

$ file pcap.pcap
pcap.pcap: pcap capture file, microsecond ts (little-endian) - version 2.4 (Ethernet, capture length 1500)

and

$ file pcap_no_extension 
pcap_no_extension: pcap capture file, microsecond ts (little-endian) - version 2.4 (Ethernet, capture length 1500)

As far as I understand, they accept new contributions, so it's entirely feasible we could add this for qlog serialization formats. In future, a binary encoding format could define some additional magic numbers to aid such sniffing. Registration requests for media types, use this template, which includes an optional Magic Numbers part.

Beyond file, other mentions of sniffing include example https://www.rfc-editor.org/rfc/rfc9110.html#section-8.3, which has some text about how to deal with content and Content-Type headers, stating:

In practice, resource owners do not always properly configure their origin server to provide the correct Content-Type for a given representation. Some user agents examine the content and, in certain cases, override the received type (for example, see [Sniffing]). This "MIME sniffing" risks drawing incorrect conclusions about the data, which might expose the user to additional security risks (e.g., "privilege escalation"). Furthermore, distinct media types often share a common data format, differing only in how the data is intended to be processed, which is impossible to distinguish by inspecting the data alone. When sniffing is implemented, implementers are encouraged to provide a means for the user to disable it.

The MIME sniffing doc itself is very detailed. To pick some relevant parts, there's https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type that describes a matching algorithm, which is run over the resource header, defined in https://mimesniff.spec.whatwg.org/#reading-the-resource-header. As a snippet:

To read the resource header, perform the following steps:

Let buffer be a byte sequence. Read bytes of the resource into buffer until one of the following conditions is met: the end of the resource is reached. the number of bytes in buffer is greater than or equal to 1445. a reasonable amount of time has elapsed, as determined by the user agent.

There looks to have been an attempt to write something up in an I-D but it seems it wasn't adopted https://datatracker.ietf.org/doc/html/draft-abarth-mime-sniff-06. Not sure why, maybe someone more familiar with the history knows. But AFAIK web content sniffing is commonplace.

marten-seemann commented 1 month ago

Very interesting, thank you for researching this @LPardue! It's kind of sad that we don't have a better of way of determining the file type, but it seems like we're not doing something that's outrageously out of line here.