jjlee / mechanize

Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
http://wwwsearch.sourceforge.net/mechanize/
618 stars 121 forks source link

Incorrect textarea CRLF normalization when parsing chunks. #112

Open simpoir opened 8 years ago

simpoir commented 8 years ago

Current behavior is to normalize isolated CR or LF to CRLF when they are present in a textarea. However, chunked parser may split a perfectly valid CRLF into 2 CRLF if they happen on the chunk boundary. Here is a sample code illustrating this issue by aligning this CRLF to a 1024 CHUNK limit.

>>> import StringIO
>>> import mechanize
>>> # erroneous normalization of CRLF in textarea on CHUNK size boundary
>>> doc = "{:>1023}\r\nbar</textarea></form></html>".format("<html><form><textarea>foo")
>>> pf = mechanize.ParseFile(StringIO(doc), "http://localhost/")
>>> fp[0].controls[0].value
'foo\r\n\r\nbar'
>>> # standard and expected parsing
>>> doc = "{}\r\nbar</textarea></form></html>".format("<html><form><textarea>foo")
>>> pf = mechanize.ParseFile(StringIO(doc), "http://localhost/")
>>> fp[0].controls[0].value
'foo\r\nbar'

The issue can easily be fixed by doing the normalization after reaching the end tag instead of with incomplete data.

@@ -533,7 +533,10 @@
             raise ParseError("end of TEXTAREA before start")
         controls = self._current_form[2]
         name = self._textarea.get("name")
+        value = self._textarea.get("value")
+        if value:
+            self._textarea["value"] = normalize_line_endings(value)
         controls.append(("textarea", name, self._textarea))
         self._textarea = None

     def start_label(self, attrs):
@@ -580,7 +583,6 @@
         elif self._textarea is not None:
             map = self._textarea
             key = "value"
-            data = normalize_line_endings(data)
         # not if within option or textarea
         elif self._current_label is not None:
             map = self._current_label