Closed baltpeter closed 5 months ago
Now, I guess we could try convincing mitmproxy that the HAR export should encode the body as base64 in cases with null bytes. But then we'd also have to keep that in mind if we ever switch to another HAR producer.
Or I guess we could manually coerce the HARs in CA. But all of that would be really annoying.
And, also, I really don't think it is… correct. At least in JS and Python land, having null bytes in JSON strings is perfectly fine. I feel like we should be able to handle those JSONs.
Perhaps the better way forward is to reconsider storing the HAR in a JSON column in the first place. We don't actually have any use for that. We are only storing the HAR in order to attach it to the complaints and other messages, but we never actually look inside in the context of platform
. That is only done by TrackHAR, which can handle these JSONs just fine.[^view-requests]
[^view-requests]: In the future, I do think it would be nice to publish the requests observed in analyses. However, I don't think we should reinvent that functionally in platform
. Instead, we should dump the HARs into data.tweasel.org
and link there.
Given that we are treating the HARs as a blob here, we should probably just also store them as such.
Actually, it's probably for the best anyway since I would expect jsonb
not to preserve formatting (but we specify a hash of the HAR in the reports).
The only thing I'm not quite certain about is whether we should store them as bytes
or just plain string
.
I'm leaning towards string
, since the HARs will definitely always be valid UTF-8 and bytes
seem a little more annoying to deal with.
I also want to define a size limit—we obviously can't store arbitrarily large HARs.
The HARs from monkey-april-2023
range from 407 bytes (empty) to 403.8 MB. That's quite the range!
400 MB seems like way too much but I'm unsure as to what an appropriate cut-off would be.
I was quite intimidated by the huge files but GPT4o helped me put this into perspective:
50 MB strikes me as a decent limit: That allows for 99% of everything we've seen so far. And since the median is ~1 KB, I'm not too worried about how much storage we're going to need for the extremes. We can always adjust later if we find the limit doesn't work well for us.
Since HARs are just JSON with a spec, I was planning on storing our HARs in a JSON column. However, with the very first one I tried inserting, I got the following error:
Here's an excerpt section of the HAR that is the culprit:
Note the raw null bytes encoded as Unicode escape sequences.
EdgeDB's
json
type is backed by Postgres'jsonb
type.And
jsonb
is a bit picky with what it accepts.\u0000
gets rejected.