richvdh commented 3 years ago

Background

We've recently encountered a number of bugs in which malformed (or at least, unexpected) data has caused Synapse or clients to misbehave in some way. These bugs stem from the fact that, faced with a given datastructure, you cannot rely on it having the expected format. For example:

What happens if the displayname in an m.room.member event is not a string (#8220, #8340)?
What happens if an event object is missing its origin field (#8319)?

Such bugs are disruptive, and in extreme cases could progress beyond "denial of service" to "security threat", and it might be possible to avoid them by validating data at the point of entry to Synapse. We've recently discussed this in some depth within the core Synapse team; this issue serves to record some of our thoughts on the topic, including promising areas for further development.

Introduction

There are actually multiple reasons why it would be useful to improve validation of data within Synapse. These include:

As above, avoiding potential bugs
Simplifying code by removing isinstance checks everywhere
Forcing clients to follow the specification rather than using "whatever works in Synapse", thus improving compatibility with other server implementations which do enforce the spec (or conversely: not forcing other implementations to grandfather in Synapse's misfeatures for compatibility with clients)
Making synapse return 400 errors rather than "Internal Server Errors" if clients pass invalid parameters would:
- Give better feedback to client developers: the 500 error gives the developer little information about their error.)
- Allow server administrators to distinguish between "my server is having problems" and "somebody passed invalid parameters to my server"
- Allow load-balancers to distinguish between "this process is having problems" and "somebody passed invalid parameters". Notably, Cloudflare will stop sending requests to an origin which returns too many 500 errors.

There are multiple related, but different, things we mean when we talk about validation. At a high level, these can be broken down into:

validating API request parameters (both CS and SS API); and
validating events (and other such data types, like EDUs etc).

Let's consider these in turn.

API validation

Given that we already have JSON schema specifications for our APIs, this is theoretically relatively straightforward (see, for example https://github.com/matrix-org/synapse/commit/c39941cd6e3727d5571d01bc6f72d7d439f41d13, which is a proof-of-concept applying this to a single endpoint), though it's certain to uncover a large number of places where clients are relying on non-spec-compliant behaviour. We also have to be wary, especially on the SS API, to ensure backwards compatibility.

Doing this would certainly reduce the number of false 500 errors, which as above brings a number of advantages. It might also reduce the occurrences of bugs due to bad events (e.g. non-string display names) since updated Synapses will correctly reject them on the CS API; however, it will absolutely not fix those bugs, since such malformed data could still be received from buggy or malicious servers over federation.

Event validation

Validating events, particularly those received over federation, is quite hard as:

you need to decide what to do with events that fail validation,
retaining backwards compatibility means it’s hard to reject/drop/decline to handle events that fail strict validation.

For example, an event with malformed m.relates_to data can’t just be dropped as, according to the authentication rules of all room versions to date, it is a valid event, even though its payload (that Synapse still interprets) is invalid, since annotations were added after the current room versions were specced.

The main problem we have currently is that events contain a mix of properties which are validated on receipt (auth_events, prev_events, etc) alongside a bunch of untrusted data that we cannot assume nor assert the types of. Then, when we come to access event data, we need to remember to add checks for any untrusted data. While room versions allow us to add stricter schemas, relying on that approach will always be a case of playing catchup as we’ll want to use new features in older room versions where possible. (See also MSC2801 on this subject.)

Ideally, therefore, we’d try and add some tooling to make it easier to statically assert (or at least tell easily in the code) whether the fields you’re accessing on events (and other data types) have been validated or not. Hopefully it is possibly to do something with mypy here, however it will require some experimentation here.

Conclusions

In summary there are a few paths going forward:

validate schemas of APIs;
validate a basic schema for events (and other data types), potentially using room versions to allow us to enforce stricter schemas while retaining backwards compatibilty; and
experiment with mypy and other tooling (as well as code changes) to try and highlight the difference between validated and unvalidated data in events, and to enforce correct validating when using untrusted data.

Both 1 and 2 would be good things to do, and may reduce the number of occurrences of bugs, but won’t actually fix the class of bugs we’re seeing due to unexpected formats of various keys in events. However, while paths 1 and 2 are something that we know how to do, path 3 has a lot more unknowns attached to it, but ultimately is the only option that will fully prevent the class of bugs we’re seeing.

ShadowJonathan commented 3 years ago

Small note about 3: mypy could be a good fit to enforce internal consistency of data access after validation checks have passed, by annotating return types with TypedDict or @attr.s classes, with which other areas of the program can then work guaranteed with validated and enforced data structures. Also, after validation, passed data structure (but not content) should be immutable, only transformed to database-ready types or dicts (for example) for serialisation. (#8376 could be relevant to this point)

Additionally; Should we use 400 when schema validation fails, and 422 (Unprocessable Entity) when event validation fails? Should that be reflected back into the spec?

(422 according to the mozilla documentation: "The HyperText Transfer Protocol (HTTP) 422 Unprocessable Entity response status code indicates that the server understands the content type of the request entity, and the syntax of the request entity is correct, but it was unable to process the contained instructions.")

bmarty commented 3 years ago

I will not create a dedicated issue, but for instance Synapse CS API will accept if a client send a state event with un-specified value. For instance using "foo" for the "join_rule" field of https://matrix.org/docs/spec/client_server/r0.6.1#m-room-join-rules

This break compatibility on some clients (ex: on Element Android this Event is ignored and not displayed in the timeline), and I'm not sure what happen server side regarding the room properties and what should be the fallback we display to the user in this case.

At least on the CS API I would expect to get an error 400 when trying to send such malformed event.

ShadowJonathan commented 3 years ago

I'd also like to mention pydantic here as an interesting option, as it provides speedy (C-extension-backed) validation of objects.

clokep commented 2 years ago

(see, for example c39941c, which is a proof-of-concept applying this to a single endpoint)

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower). I had put together a benchmark script at https://gist.github.com/clokep/17064b16a9b471d09feb3e61851886af

Results from running this on my 2016 MBP with Python 3.9.6 (times in nanoseconds):

Test	Manual	JSON Schema
Empty	305	14200
Error	309	39800
Correct	294	22600

Raw results

``` ..................... WARNING: the benchmark result may be unstable * the standard deviation (3.38 us) is 24% of the mean (14.2 us) * the maximum (31.5 us) is 121% greater than the mean (14.2 us) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. json_schema_empty: Mean +- std dev: 14.2 us +- 3.4 us ..................... manual_empty: Mean +- std dev: 305 ns +- 21 ns ..................... WARNING: the benchmark result may be unstable * the standard deviation (5.16 us) is 13% of the mean (39.8 us) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. json_schema_wrong_type: Mean +- std dev: 39.8 us +- 5.2 us ..................... WARNING: the benchmark result may be unstable * the maximum (475 ns) is 54% greater than the mean (309 ns) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. manual_wrong_type: Mean +- std dev: 309 ns +- 27 ns ..................... json_schema_good: Mean +- std dev: 22.6 us +- 1.4 us ..................... manual_good: Mean +- std dev: 294 ns +- 11 ns ```

DMRobertson commented 2 years ago

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower).

@clokep I think your benchmark just does the validation and nothing else? I'd be interested to see how those numbers compare when they're part of request handling as a whole.

clokep commented 2 years ago

I had done some testing with this and found it to be quite a bit slower (from memory about 100x slower).

@clokep I think your benchmark just does the validation and nothing else? I'd be interested to see how those numbers compare when they're part of request handling as a whole.

Yes, that's what I was attempting to benchmark. 😄 (It also is completely fake -- the schemas of most of our endpoints are much more complicated.) As I had said in chat, I didn't spend too much time on this and was attempting to get a gut of whether it would make sense to drop a lot more time into it or not.

Kay-kay2019 commented 2 years ago

Working it

jaywink commented 2 years ago

One case I ran into - Synapse accepts an object into the allow of join rules, when it should be an array of objects.

matrix-org / synapse

Improve API/event validation in synapse #8445

Background

Introduction

API validation

Event validation

Conclusions