papnkukn / eml-format

RFC 822 EML file format parser and builder
MIT License
88 stars 53 forks source link

Extend `unquotePrintable` function to support 4-byte Unicode characters and concatenated sequences #33

Open joaoaugustogrobe opened 1 year ago

joaoaugustogrobe commented 1 year ago

The current unquotePrintable function does not correctly support 4-byte Unicode characters and has issues in parsing multiple concatenated Unicode character sequences, such as =C9=91=E2=8D=BA (ɑ⍺ - 2 bytes, 3 bytes). The function incorrectly parses this input as ɑ���.

To resolve this issue, we need to:

  1. Extend the function to support 4-byte Unicode characters.
  2. Enable the function to correctly handle multiple concatenated Unicode characters.

A possible solution involves using the first byte of the Unicode character to determine the number of bytes it contains, as described in the IBM documentation. We can implement a recursive helper function that takes the entire Unicode sequence, determines the length of the next character based on the first byte, parses the character, and then calls the function recursively for the subsequent characters.

This enhancement will ensure that the unquotePrintable function properly handles various Unicode character sequences, allowing for more accurate parsing and processing of text data.