ironfede / openmcdf

Microsoft Compound File .net component - pure C# - netstandard 2.0
Mozilla Public License 2.0
302 stars 73 forks source link

System.ArgumentOutOfRangeException when trying to read property streams with a UTF-8 code page #33

Closed Numpsy closed 5 years ago

Numpsy commented 5 years ago

Hi,

I was trying to do a test of reading properties from a Word document using the openmcdf extensions rather than native functions, and got a

System.ArgumentOutOfRangeException : Valid values are between 0 and 65535, inclusive.
Parameter name: codepage

with the callstack

at System.Text.Encoding.GetEncoding(Int32 codepage)
   at OpenMcdf.Extensions.OLEProperties.PropertyFactory.VT_LPSTR_Property.ReadScalarValue(BinaryReader br)
   at OpenMcdf.Extensions.OLEProperties.PropertyFactory.VT_LPSTR_Property.Read(BinaryReader br)
   at OpenMcdf.Extensions.OLEProperties.PropertyReader.ReadProperty(UInt32 propertyIdentifier, BinaryReader br)
   at OpenMcdf.Extensions.OLEProperties.PropertySetStream.Read(BinaryReader br)
   at OpenMcdf.Extensions.CFStreamExtension.AsOLEProperties(CFStream cfStream)

when calling AsOIeProperties on a SummaryInformation stream which i believe has a codepage of UTF-8. I haven't looked at it too deeply, but i've seen situations in the past where the codepage of 65001 gets interpreted as a negative number, so i'm wondering if that's what is happening here?

Thanks.

english.presets.zip

ironfede commented 5 years ago

@Numpsy , I've fixed this issue. It's a very strange behaviour of CodePage value (a signed short). Its max value should be 32767, but casting to an int we find 650001. I'havent still found specific documentation for this issue but I'm looking for it. Best Regards, Federico

Numpsy commented 5 years ago

It's a bit of an unfortunate situation, I've seen similar in c++ code that reads the property sets using the Windows native compound document apis, and that just cast it and/or interprets it as a different type to get the correct value.

All I can really say is that Windows Explorer (in Win10 at least) seems to set the code page to UTF-8 when you change the file properties through it, and i assume it knows what it's doing.