Not all of our HARs can be inserted into an EdgeDB JSON column

tweaselORG / platform

Server for the tweasel.org platform, allowing users to analyse Android and iOS apps for data protection violations and send complaints about them to the data protection authorities.

MIT License

1 stars 0 forks source link

Not all of our HARs can be inserted into an EdgeDB JSON column #6

Closed baltpeter closed 5 months ago

baltpeter commented 5 months ago

Since HARs are just JSON with a spec, I was planning on storing our HARs in a JSON column. However, with the very first one I tried inserting, I got the following error:

InternalServerError: unsupported Unicode escape sequence

Here's an excerpt section of the HAR that is the culprit:

Note the raw null bytes encoded as Unicode escape sequences.

EdgeDB's json type is backed by Postgres' jsonb type.

And jsonb is a bit picky with what it accepts. \u0000 gets rejected.

baltpeter commented 5 months ago

Now, I guess we could try convincing mitmproxy that the HAR export should encode the body as base64 in cases with null bytes. But then we'd also have to keep that in mind if we ever switch to another HAR producer.
Or I guess we could manually coerce the HARs in CA. But all of that would be really annoying.

And, also, I really don't think it is… correct. At least in JS and Python land, having null bytes in JSON strings is perfectly fine. I feel like we should be able to handle those JSONs.

baltpeter commented 5 months ago

Perhaps the better way forward is to reconsider storing the HAR in a JSON column in the first place. We don't actually have any use for that. We are only storing the HAR in order to attach it to the complaints and other messages, but we never actually look inside in the context of platform. That is only done by TrackHAR, which can handle these JSONs just fine.[^view-requests]

[^view-requests]: In the future, I do think it would be nice to publish the requests observed in analyses. However, I don't think we should reinvent that functionally in platform. Instead, we should dump the HARs into data.tweasel.org and link there.

Given that we are treating the HARs as a blob here, we should probably just also store them as such.

baltpeter commented 5 months ago

Actually, it's probably for the best anyway since I would expect jsonb not to preserve formatting (but we specify a hash of the HAR in the reports).

baltpeter commented 5 months ago

The only thing I'm not quite certain about is whether we should store them as bytes or just plain string.

I'm leaning towards string, since the HARs will definitely always be valid UTF-8 and bytes seem a little more annoying to deal with.

baltpeter commented 5 months ago

I also want to define a size limit—we obviously can't store arbitrarily large HARs.

The HARs from monkey-april-2023 range from 407 bytes (empty) to 403.8 MB. That's quite the range!

400 MB seems like way too much but I'm unsure as to what an appropriate cut-off would be.

baltpeter commented 5 months ago

I was quite intimidated by the huge files but GPT4o helped me put this into perspective:

Mean file size: 3.28 MB
Median file size: 0.00 MB
Standard deviation of file sizes: 13.18 MB
Minimum file size: 0.00 MB
Maximum file size: 385.13 MB
25th percentile: 0.00 MB
75th percentile: 0.83 MB
Number of files larger than 50 MB: 55
Percentage of files larger than 50 MB: 1.17%

baltpeter commented 5 months ago

50 MB strikes me as a decent limit: That allows for 99% of everything we've seen so far. And since the median is ~1 KB, I'm not too worried about how much storage we're going to need for the extremes. We can always adjust later if we find the limit doesn't work well for us.