How do I skip over invalid UTF-16 string?

jengelh commented 3 years ago

I have a somewhat broken PST file here where a 0x3001001F property has a byte sequence \x3D\xD8 in it, which gets rightfully rejected by libuna (>4932) just as it is by /usr/bin/iconv. But unlike iconv where I can pass -t UTF-8//IGNORE, how can I tell libuna to skip over unconvertible sequences rather than aborting?

(gdb) bt
#0  libuna_unicode_character_copy_from_utf16_stream (unicode_character=0x7fffffffd8ac, utf16_stream=<optimized out>, 
    utf16_stream_size=<optimized out>, utf16_stream_index=0x7fffffffd8b0, byte_order=<optimized out>, error=0x0)
    at libuna_unicode_character.c:4932
#1  0x00007ffff734bfe1 in libuna_utf8_string_size_from_utf16_stream (utf16_stream=utf16_stream@entry=0x485c20 "=\330P", 
    utf16_stream_size=utf16_stream_size@entry=40, byte_order=byte_order@entry=108, 
    utf8_string_size=utf8_string_size@entry=0x7fffffffd988, error=error@entry=0x0) at libuna_utf8_string.c:1871
#2  0x00007ffff7f2fbab in libpff_mapi_value_get_data_as_utf8_string_size (error=0x0, utf8_string_size=0x7fffffffd988, 
    ascii_codepage=<optimized out>, value_data_size=40, value_data=0x485c20 "=\330P", value_type=<optimized out>)
    at libpff_mapi_value.c:155
#3  libpff_mapi_value_get_data_as_utf8_string_size (value_type=<optimized out>, value_data=0x485c20 "=\330P.........", value_data_size=40, 
    ascii_codepage=<optimized out>, utf8_string_size=0x7fffffffd988, error=0x0) at libpff_mapi_value.c:90
#4  0x00007ffff7f414b2 in libpff_record_entry_get_data_as_utf8_string_size (record_entry=<optimized out>, 
    utf8_string_size=<optimized out>, error=0x0) at libpff_record_entry.c:1868

4927                    /* Determine if the UTF-16 character is within the low surrogate range
4928                     */
4929                    if( ( utf16_surrogate < LIBUNA_UNICODE_SURROGATE_LOW_RANGE_START )
4930                     || ( utf16_surrogate > LIBUNA_UNICODE_SURROGATE_LOW_RANGE_END ) )
4931                    {
4932>                           libcerror_error_set(
4933                             error,
4934                             LIBCERROR_ERROR_DOMAIN_RUNTIME,
4935                             LIBCERROR_RUNTIME_ERROR_UNSUPPORTED_VALUE,
4936                             "%s: unsupported low surrogate UTF-16 character.",

joachimmetz commented 3 years ago

You current cannot tell libuna to skip over nonconvertible sequences. The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.

Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16

jengelh commented 3 years ago

The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.

Certainly; however, after the “investigative part”, when truncation is ok, it comes as unusual having to spin up iconv/icu and do a UTF-16 -> UTF-8//IGNORE conversion; I'd prefer just reusing the conversion from pff/una, since it's already a dependency.

Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16

All I know is that this PST was generated with Outlook 2010 (SP3?). \xD8\x3D is not a good codepoint even in UCS-2. It is possible that the data store already has had those bytes and Outlook just passed it on when it exported to PST.

NigelPearson commented 1 year ago

I also have some nasty .pst that are triggering a similar issue:

$ anaconda3/bin/python ./extract.py Writing messages to /Users/markmail/Desktop/Outlook Files/Backup Number of messages: 4647 Number of messages: 3296 Traceback (most recent call last): File "/Users/markmail/./extract.py", line 25, in <module> Subj = message.subject OSError: pypff_message_get_subject: unable to retrieve subject size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_internal_item_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size. libpff_message_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size.

Will try to split the (probably corrupt) .pst and attach a sample, but as spammers & hackers are increasingly trying to exploit anything they can, this sort of thing will only manifest more

joachimmetz commented 1 year ago

are you sure the string is UTF-16? or could it be Windows UCS-2?

deajan commented 1 year ago

I too have that error when trying to read some PST files.

Traceback (most recent call last):
  File "/stor/user/ext/extractor.py", line 38, in <module>
    extract_eml(file)
  File "/stor/user/ext/extractor.py", line 26, in extract_eml
    print(message.plain_text_body)
OSError: pypff_message_get_plain_text_body: unable to retrieve plain text body size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_message_get_plain_text_body_size: unable to determine message body size.

I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings. I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi under linux gives me application/vnd.ms-outlook; charset=binary and text/plain; charset=unknown-8bit.

@joachimmetz What do I need to do to get you the encoding ?

joachimmetz commented 1 year ago

I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings.

there is https://github.com/libyal/libpff/wiki/Troubleshooting#format-or-behavioral-errors, don't need to use the Python bindings, given that the trace back hints the issue surfaces in libuna_utf8_string_size_from_utf16_stream

I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi under linux gives me application/vnd.ms-outlook; charset=binary and text/plain; charset=unknown-8bit.

Please also read up on what a PST is (a MAPI database) and how information is stored. Libpff is intended to provide you low-level access to the data format, but you'll need to understand how the information on top of that is organized.

deajan commented 1 year ago

As far as I understood the error, there's only one message that affects the extraction. If I skip this message, everything works fine so it doesn't seem like a PST data format problem, but rather an encoding issue in the affected mail. I'd love to help debug, but I can only do this via python, I'm not a C guy. Any --debug parameter that exists in python bindings perhaps ?

libyal / libpff

How do I skip over invalid UTF-16 string? #99