ironfede / openmcdf

Microsoft Compound File .net component - pure C# - netstandard 2.0
Mozilla Public License 2.0
308 stars 76 forks source link

System.ArgumentOutOfRangeException when reading properties from a document that contains LPWSTR/Vector properties #98

Closed Numpsy closed 1 month ago

Numpsy commented 1 year ago

Hi, I was using OpenMcdf.Extensions to read some office documents of various sorts, and was getting a System.ArgumentOutOfRangeException for a certain test Excel spreadsheet.

Debugging throught it, the property it was trying to read at the time was of type VT_VECTOR | VT_LPWSTR, and it seemed to be because it was getting the lengths of the strings in the Vector after the first wrong and trying to read much more data than needed.

I saw https://github.com/ironfede/openmcdf/pull/74 that fixed an issue similar to this, but - I don't think that's quite right, as the MS documentation at https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/9660cb24-953a-4e60-adf2-37cc0e779d19 says that each string should be padded to a multiple of 4 bytes length, and the existing vector code at https://github.com/ironfede/openmcdf/blob/69cc938045fa69599c1717db8eb6f0568a74386f/sources/OpenMcdf.Extensions/OLEProperties/TypedPropertyValue.cs#L107 will only handle padding for the whole Vector.

So - looks to me like it should consume any padding bytes per-string entry, rather than per-vector?

Numpsy commented 1 year ago

Ok, I see the issue -

The MS doc at https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/3ef02e83-afef-4b6c-9585-c109edd24e07 says that the PIDDSI_DOCPARTS property:

MUST be a [VtVecUnalignedLpstr](https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/523af67d-d69a-4eb1-848b-a299674ac9b8) (section 2.3.3.1.10) or [VtVecLpwstr](https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/2dc59d31-a903-49bb-a1f1-45fc01c8b8b0) property (section 2.3.3.1.8).

So, if the strings are LPWSTR then they must be padded, but if they're LPSTR then they shouldn't be - so the current code works for LPSTR where there is no padding, but breaks for LPWSTR because not skipping the padding causes the length calculations to get messed up.

It's all so simple :-(

ironfede commented 1 year ago

Thank you very much @Numpsy for your precise analysis and PR #99 . I've spot this issue and honestly speaking it's a little bit difficult to fix for a VtVecUNALIGNEDSomething since these properties are slightly differents from MS-OLEPS specs I've used and are specific to some properties PIDDSI_[...] like the ones you've mentioned. Anyway You're right when you say that properties should be padded and not vectors: Is it possible for you to enahnce your PR removing the padding for the vector type and adding it also to LPTSTR as well as to LPWSTR for a tentative fix of this issue? I think that this should fix the LPWSTR and breaking the LPSTR in the PIDDSI_DOCPARTS but should be a "better" way to fail and starting point to a specific general fix for those "strange" properties in SummaryInformation and DocumentSummaryInformation Property Storage Set. Many Thanks!

Numpsy commented 1 year ago

I think that this should fix the LPWSTR and breaking the LPSTR in the PIDDSI_DOCPARTS

I tried that and it does - there are some existing unit tests for the '_Test.ppt' file that start failing in a siimilar way to how the LPWSTR ones currently do.

Looking at the MS docs I'm not sure how to differentiate between a vector of LPSTR and a vector of unligned LPSTR other than having some hard coded knowledge of the property types, although looking at the list of allowed properties in the document summary information I don't see any other vector types and the user defined properties doesn't ist vector as a supported type

Numpsy commented 1 year ago

Is it reasonable to add a VT_UnalignedLPSTR_Property class to PropertyFactory alongside the existing VT_LPSTR_Property class to handle different padding behaviours? (TBD on how to decide which one to use)

ironfede commented 1 year ago

Yes absolutely. I think that a delegation logic following ole container type should be used . This implies that PropertyFactory should be refactored out of its "Factory" pure pattern to get a parameter in the constructor indicating the parent container type. I think that this could be useful in its general form as containers dictates properties available and could be further expanded as concept.

Numpsy commented 7 months ago

Still hoping to get back to looking at this, but for now a thought I had on the

This implies that PropertyFactory should be refactored out of its "Factory" pure pattern to get a parameter in the constructor indicating the parent container type.

comment - you could possibly have different factory classes/instances for the different property set types, which contain the specific knowledge about available properties in each set?

Numpsy commented 1 month ago

@ironfede A question/observation about the lastest changes:

I was just trying to get #108 updated on top of the latest changes and was getting an exception when reading my new test excel file that wasn't there before, because the ReadBytes at https://github.com/ironfede/openmcdf/blob/2c60fa88fc7e98f918cdf8606e05f0b076c53a70/sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs#L512 was trying to read -2 bytes when reading a Vector/LPWSTR property.

Looking at the current code, I'm wondering if the change in https://github.com/ironfede/openmcdf/blob/2c60fa88fc7e98f918cdf8606e05f0b076c53a70/sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs#L512 to take 2 bytes off the length read to skip the terminating null might have broken something in the Vector case, because in the second iteration of the loop at https://github.com/ironfede/openmcdf/blob/2c60fa88fc7e98f918cdf8606e05f0b076c53a70/sources/OpenMcdf.Extensions/OLEProperties/TypedPropertyValue.cs#L99 the stream is now positioned at the null terminator such that it reads the nulls as a length and thinks the length is 0 when it shouldn't be?

Basically I'm wondering if VT_LPWSTR_Property.ReadScalarValue should skip over the two null bytes rather then ignoring them?

ironfede commented 1 month ago

This is becoming a little bit hard to follow... I think You've found really a good point here: that (-2) subtraction is absolutely wrong imho. If it is embedded in the vector loop reading, it's going to break indexing logic so I think it should be skipped to avoid presenting a null character to upper layers of the presentation and not ignored in this specific property type.

Do you have a test file to check a fix for VECTOR|VT_LPWSTR with a specific PR/Unit test introducing the skipping and removing the -2 subtraction?

Many many thanks @Numpsy !

Numpsy commented 1 month ago

I noticed it when testing with the SampleWorkBook_bug98.xls test file added in #108

This is becoming a little bit hard to follow

Yeah, there are multiple issues that could perhaps to with describing a bit more cleanly now

ironfede commented 1 month ago

I've introduced the null character skipping. Please let me know if this is ok in your opinion @Numpsy so we can close this issue. Many thanks.

Numpsy commented 1 month ago

It seems ok to me now. I'll get a test build tommorow and try to run it over some more old files as an extended test, but hopefully the OP issue is fixed now.