Open jengelh opened 3 years ago
You current cannot tell libuna to skip over nonconvertible sequences. The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.
Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16
The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.
Certainly; however, after the “investigative part”, when truncation is ok, it comes as unusual having to spin up iconv/icu and do a UTF-16 -> UTF-8//IGNORE conversion; I'd prefer just reusing the conversion from pff/una, since it's already a dependency.
Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16
All I know is that this PST was generated with Outlook 2010 (SP3?). \xD8\x3D is not a good codepoint even in UCS-2. It is possible that the data store already has had those bytes and Outlook just passed it on when it exported to PST.
I also have some nasty .pst that are triggering a similar issue:
$ anaconda3/bin/python ./extract.py Writing messages to /Users/markmail/Desktop/Outlook Files/Backup Number of messages: 4647 Number of messages: 3296 Traceback (most recent call last): File "/Users/markmail/./extract.py", line 25, in <module> Subj = message.subject OSError: pypff_message_get_subject: unable to retrieve subject size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_internal_item_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size. libpff_message_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size.
Will try to split the (probably corrupt) .pst and attach a sample, but as spammers & hackers are increasingly trying to exploit anything they can, this sort of thing will only manifest more
are you sure the string is UTF-16? or could it be Windows UCS-2?
I too have that error when trying to read some PST files.
Traceback (most recent call last):
File "/stor/user/ext/extractor.py", line 38, in <module>
extract_eml(file)
File "/stor/user/ext/extractor.py", line 26, in extract_eml
print(message.plain_text_body)
OSError: pypff_message_get_plain_text_body: unable to retrieve plain text body size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_message_get_plain_text_body_size: unable to determine message body size.
I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings.
I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi
under linux gives me application/vnd.ms-outlook; charset=binary
and text/plain; charset=unknown-8bit
.
@joachimmetz What do I need to do to get you the encoding ?
I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings.
there is https://github.com/libyal/libpff/wiki/Troubleshooting#format-or-behavioral-errors, don't need to use the Python bindings, given that the trace back hints the issue surfaces in libuna_utf8_string_size_from_utf16_stream
I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi under linux gives me application/vnd.ms-outlook; charset=binary and text/plain; charset=unknown-8bit.
Please also read up on what a PST is (a MAPI database) and how information is stored. Libpff is intended to provide you low-level access to the data format, but you'll need to understand how the information on top of that is organized.
As far as I understood the error, there's only one message that affects the extraction. If I skip this message, everything works fine so it doesn't seem like a PST data format problem, but rather an encoding issue in the affected mail. I'd love to help debug, but I can only do this via python, I'm not a C guy. Any --debug parameter that exists in python bindings perhaps ?
I have a somewhat broken PST file here where a 0x3001001F property has a byte sequence \x3D\xD8 in it, which gets rightfully rejected by libuna (>4932) just as it is by /usr/bin/iconv. But unlike iconv where I can pass
-t UTF-8//IGNORE
, how can I tell libuna to skip over unconvertible sequences rather than aborting?