pallets / werkzeug

The comprehensive WSGI web application library.
https://werkzeug.palletsprojects.com
BSD 3-Clause "New" or "Revised" License
6.66k stars 1.74k forks source link

Allow customization of decode error handling in MultiPartParser #3009

Open glumia opened 6 days ago

glumia commented 6 days ago

Hello werkzeug team!

I would like to get some feedback on the possibility to rework a bitFormDataParser and MultiPartParser so that one can easily configure the decode error handling strategy that's currently hard-coded as "replace" here.

The reason I come with this proposal is that at work we integrate with a partner that sends us multipart/form-data requests that do not carry any encoding information for the fields of the payload. Instead, they add another field in the payload itself that contains the charset information for the others (see here for an example).[^1]

This forces us to do a first pass where we parse all the content of the request and then, when we got the charset information, we re-encode all the fields that were not encoded in utf-8 and decode them with the proper charset.[^2]

The problem is that if the error handling strategy is set to "replace", we get some data loss for any characters that utf-8 isn't able to decode.

What we need in this case is to use "surrogateescape", and to achieve that right now we're forced to reimplement the whole FormDataParser._parse_multipart and MultiPartParser.parse for what's essentially a one-line change.


Eventually, these are roughly the changes I had in my mind (the diff is with 7868bef, note that we would need to move the declaration of FormParser after MultiPartParser but I didn't do it here to not bloat the diff):

diff --git a/src/werkzeug/formparser.py b/src/werkzeug/formparser.py
index 01034149..591a9317 100644
--- a/src/werkzeug/formparser.py
+++ b/src/werkzeug/formparser.py
@@ -167,6 +167,8 @@ class FormDataParser:
     .. versionadded:: 0.8
     """

+    multipart_parser_cls = MultiPartParser
+
     def __init__(
         self,
         stream_factory: TStreamFactory | None = None,
@@ -253,7 +255,7 @@ class FormDataParser:
         content_length: int | None,
         options: dict[str, str],
     ) -> t_parse_result:
-        parser = MultiPartParser(
+        parser = self.multipart_parser_cls(
             stream_factory=self.stream_factory,
             max_form_memory_size=self.max_form_memory_size,
             max_form_parts=self.max_form_parts,
@@ -290,6 +292,8 @@ class FormDataParser:

 class MultiPartParser:
+    decode_error_handling = "replace"
+
     def __init__(
         self,
         stream_factory: TStreamFactory | None = None,
@@ -392,7 +396,8 @@ class MultiPartParser:
                     if not event.more_data:
                         if isinstance(current_part, Field):
                             value = b"".join(container).decode(
-                                self.get_part_charset(current_part.headers), "replace"
+                                self.get_part_charset(current_part.headers),
+                                self.decode_errors_handling,
                             )
                             fields.append((current_part.name, value))
                         else:

That would allow one to easily define a custom form data parser that's identic in behavior to FormDataParser besides the decode error handling for multipart/form-data requests.

For example:

from werkzeug.formparser import FormDataParser, MultiPartParser

class SurrogateEscapeMultiPartParser(MultiPartParser):
    decode_errors_handling = "surrogateescape"

class SurrogateEscapeFormDataParser(FormDataParser):
    multipart_parser_cls = SurrogateEscapeMultiPartParser

What do you think?

[^1]: That's obviously nonsense; unfortunately we have no choice but to deal with it as they'll not be fixing this anytime soon. [^2]: We could probably write a custom parser that looks only for that charset field, decodes it and then does another pass on the payload using that information, but that would be a lot of added complexity to maintain for little benefit (the perf hit of re-encoding/decoding isn't an issue in this case).

glumia commented 6 days ago

I've seen now that this was possible before v3 with the encoding_errors parameter on the Request class, but that it was then removed as part of #2602.