python-websockets / websockets

Library for building WebSocket servers and clients in Python
https://websockets.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
5.23k stars 519 forks source link

UTF-8 decoding errors aren't handled properly #1523

Closed aaugustin closed 1 month ago

aaugustin commented 1 month ago

All implementations store incoming messages to a queue; then the application reads from this queue.

In the legacy asyncio, UTF-8 decoding happens in WebSocketCommonProtocol.read_message, before storing messages in the queue. As a consequence, if websockets receives a message containing invalid UTF-8 — which is vanishingly uncommon these days — a UnicodeDecodeError is raised and websockets immediately closes the connection with code 1002.

The new asyncio implementation stores messages as-is; UTF-8 decoding happens in Assembler.get. In case of invalid UTF-8, an UnicodeDecodeError is raised when the application calls recv; websockets keeps the connection open and continues happily, when it should close it.

The new implementation has two benefits and one downside:

  1. lower memory use: UTF-8 decoding can increase memory usage, depending on the contents (no change for ASCII);
  2. allows returning the bytestring: this saves a decoding-encoding round-trip when the application wants bytes;
  3. increased latency: UTF-8 decoding happens only when the application fetches the message, while it could be done earlier; this makes a difference only when a message is already available before the application reads, which is the "run with buffers always full" mode — not a good idea. Overall, it's a win, so let's keep it that way, detect decoding errors, and fail the connection in that case.

Also, while Frame.parse is documented to raise UnicodeDecodeError, it doesn't look like this can happen. Error handling in Protocol.parse is for the case when recv_frame raises UnicodeDecodeError.

aaugustin commented 1 month ago

Once this is fixed, many cases in section 6 in the Autobahn test suite should result in code 1002 rather than 1000.

aaugustin commented 1 month ago

Aligning the behavior of the threading implementation to the asyncio implementation (which I wanted to do anyway) will make it easier to get this right for both implementations.