ironfede / openmcdf

Microsoft Compound File .net component - pure C# - netstandard 2.0
Mozilla Public License 2.0
302 stars 73 forks source link

OutOfMemoryException when parsing Excel document / endless while-loop #30

Closed henning-krause closed 5 years ago

henning-krause commented 5 years ago

I'm trying to parse an Excel document (xls) using the OpenMcdf.Extensions package.

We call AsOLEProperties extension method and eventually, the execution gets stuck in this while loop (https://github.com/ironfede/openmcdf/blob/master/sources/OpenMcdf/CompoundFile.cs#L1493-L1514) :

while (true)
{
    if (nextSecID == Sector.ENDOFCHAIN)
        break;

    Sector ms = new Sector(Sector.MINISECTOR_SIZE, sourceStream);
    byte[] temp = new byte[Sector.MINISECTOR_SIZE];

    ms.Id = nextSecID;
    ms.Type = SectorType.Mini;

    miniStreamView.Seek(nextSecID * Sector.MINISECTOR_SIZE, SeekOrigin.Begin);
    miniStreamView.Read(ms.GetData(), 0, Sector.MINISECTOR_SIZE);

    result.Add(ms);

    miniFATView.Seek(nextSecID * 4, SeekOrigin.Begin);
    nextSecID = miniFATReader.ReadInt32();
}

When the loop is entered, the nextSecID is 27. At the end of the loop the nextSecID is set to 0. And the nextSecID keeps being null as the same data is read on each loop.

Is 0 even a valid value for nextSecID?

Any idea, what to do about this?

The document in question can be opened fine in Excel. Unfortunately, we got it from a customer of ours, so we can't share it.

michaelp commented 5 years ago

I think i have a fix for this but i am in the middle of the release crunch time. @henning-krause can it wait or you need it urgently?

henning-krause commented 5 years ago

@michaelp My colleague already added a PR (https://github.com/ironfede/openmcdf/pull/31) which fixes this. Until you integrate this in the official build, we'll use our fork.

ironfede commented 5 years ago

@henning-krause , fix provided doesn't prevent exception on file loading by OpenMcdf in the described use-case. Is this the expected behaviour? Fix is ok from a specification point of view and will be integrated asap but it would really useful if you could provide an example excel file without user-data that triggers the described issue in order to understand minifat chaining in the specific case. Has the original file been produced by Excel?

ironfede commented 5 years ago

@henning-krause : pr31 merged to master. @michaelp Do you have any advice/idea on this strange minifat chaining? A corrupted file?

henning-krause commented 5 years ago

@ironfede Of course, this doesn't fix the underlying problem, but at least the load operation fails fast and does not cause an OutOfMemoryException. So it's a quick fix.

That being said, I don't even think that this PR fixed all problems - only the simple ones. In multiple places, the following check is performed:

if (next != nextSecID)
    nextSecID = next;
else
    throw new CFCorruptedFileException("Cyclic sector chain found. File is corrupted");

This certainly helps when you have a corrupt file where one sector points to itself. But what if you have something like a cyclic chain?

1 -> 2 -> 3 -> 1

This wouldn't be catched and we would be in the same mess as before. I would propose a change in that check that we keep a list of all processed ids and if we hit one we had before, we'll throw the CFCorruptedFileException.

As for the source file: It was created by one of our customers. I'll try to strip out all the content and to reproduce the error. If it still happens, I might be able to share it with you.

henning-krause commented 5 years ago

@ironfede shouldn't we implement the safeguards I wrote about in my last message? We're happy to provide a PR for that.

ironfede commented 5 years ago

Ok, thank you @henning-krause ! I've reopened issue.

Numpsy commented 5 years ago

Hi,

Not sure what the situation is with the original repro for this, but on a related note:

I've been been trying to have a play with SharpFuzz, and when Itried it against OpenMcdf, it produced the file out_of_memory.zip which causes out of memory errors if you try to load it (i only had a quick look at debugging it, but it did look like it included repeated sector ids as suggested above).

Numpsy commented 5 years ago

would something like https://github.com/Numpsy/openmcdf/commit/e9badd8d2820fb94b344d1451b01327bab66bc6d work as approach to treat repeated sector ids as an error?

henning-krause commented 5 years ago

would something like Numpsy@e9badd8 work as approach to treat repeated sector ids as an error?

I think that would work - not ideal from a performance point of view. However, this check should be implemented in all the places where this error can occur: GetMiniSectorChain, GetFatSectorChain and GetNormalSectorChain.

@ironfede Would you agree? If yes, I would implement the necessary changes.

Numpsy commented 5 years ago

Yes, there are potentially issues in multiple places (there is a file attached to #40 that makes it blow up in GetDifatSectorChain).