sovity / edc-ce

sovity Community Edition EDC
https://sovity.de/en/connect-to-data-space-en/
Apache License 2.0
54 stars 15 forks source link

Investigate input of overlong UTF-8 sequences #932

Open jridderbusch opened 6 months ago

jridderbusch commented 6 months ago

Enhancement

Description

Investigate behavior when input contains overlong UTF-8 sequences (check if string validation can be bypassed; should be fine since Java converts all UTF-8 to UTF-16 before exposing it as strings, but not sure if JSON parser reads UTF-8 stream directly)

Stakeholders

@sybereal

Solution Proposal and Work Breakdown

illfixit commented 6 months ago

We have standard Angular validators for the form fields. They seem to be well tested and handle such symbols correctly.

sybereal commented 6 months ago

I believe there may have been a misunderstanding here.

UTF-8's design theoretically allows code points to be represented in different ways. Overlong UTF-8 sequences use more bytes than strictly required, while still decoding to the same code point. For example, the ASCII space character ` (U+0020) is normally encoded as a single byte0x20. However, following normal UTF-8 decoding rules, if you decode0xc0 0xa0`, you will also get U+0020 back.^stackoverflow

The concern is that, if software operates directly on UTF-8-encoded strings, such encodings could potentially be used to bypass validation checks. In the above case of the space character, a validation that checks if a certain input does not contain whitespace may naively look only for the byte 0x20, which can cause it to miss certain occurrences if input is not normalized beforehand.

Since this concerns input validation, I believe it is a backend issue, rather than (just) a frontend issue.

illfixit commented 6 months ago

I believe there may have been a misunderstanding here.

UTF-8's design theoretically allows code points to be represented in different ways. Overlong UTF-8 sequences use more bytes than strictly required, while still decoding to the same code point. For example, the ASCII space character ` (U+0020) is normally encoded as a single byte0x20. However, following normal UTF-8 decoding rules, if you decode0xc0 0xa0`, you will also get U+0020 back.12

The concern is that, if software operates directly on UTF-8-encoded strings, such encodings could potentially be used to bypass validation checks. In the above case of the space character, a validation that checks if a certain input does not contain whitespace may naively look only for the byte 0x20, which can cause it to miss certain occurrences if input is not normalized beforehand.

Since this concerns input validation, I believe it is a backend issue, rather than (just) a frontend issue.

Footnotes

  1. https://stackoverflow.com/a/7113150
  2. https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

thank you for the information!

SebastianOpriel commented 2 weeks ago

Is this really an issue of our repo or shall it be addressed in Core EDC? //Cc @efiege

sybereal commented 4 days ago

Both, since we would have to investigate the behavior of both upstream and our custom code.