Open jeremy-visionaid opened 1 day ago
As I began to understand the OpenMcdf code better and more generally the CFB format while investigating #184, I was thinking that it might be quite difficult to add the features I've previously mentioned to v2. So, at the end of last week I took the liberty and started some proof of concept work for a new approach for version 3. I've adapted/written code to parse headers and directory entries, enumerate FAT sectors, traverse sector chains and read the contents of a stream. There's obviously quite a lot of functionality missing (substorage traversal, reading mini FAT, any kind of writing), but it is at least enough to spin up the equivalent of the "InMemory" benchmark:
Windows Structured Storage (ILockBytes over a MemoryStream) | Method | BufferSize | TotalStreamSize | Mean | Error | StdDev | Allocated |
---|---|---|---|---|---|---|---|
Test | 1048576 | 1048576 | 215.4 us | 2.25 us | 2.11 us | 440 B | |
Test | 524288 | 1048576 | 215.6 us | 4.07 us | 4.36 us | 440 B | |
Test | 262144 | 1048576 | 212.6 us | 4.06 us | 4.51 us | 440 B | |
Test | 131072 | 1048576 | 211.4 us | 1.71 us | 1.42 us | 440 B | |
Test | 4096 | 1048576 | 205.6 us | 1.78 us | 1.58 us | 440 B | |
Test | 1024 | 1048576 | 237.9 us | 4.69 us | 4.39 us | 440 B | |
Test | 512 | 1048576 | 307.4 us | 6.02 us | 9.72 us | 440 B |
OpenMcdf v2.3.1 | Method | BufferSize | TotalStreamSize | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|
Test | 1048576 | 1048576 | 187.9 us | 3.71 us | 8.14 us | 137.6953 | 68.6035 | - | 1.1 MB | |
Test | 524288 | 1048576 | 196.9 us | 3.85 us | 5.39 us | 140.1367 | 55.1758 | - | 1.12 MB | |
Test | 262144 | 1048576 | 219.8 us | 2.81 us | 2.76 us | 144.2871 | 72.0215 | - | 1.15 MB | |
Test | 131072 | 1048576 | 276.2 us | 5.04 us | 4.47 us | 152.8320 | 76.1719 | - | 1.22 MB | |
Test | 4096 | 1048576 | 3,899.4 us | 74.65 us | 69.83 us | 671.8750 | 335.9375 | - | 5.41 MB | |
Test | 1024 | 1048576 | 15,455.3 us | 141.46 us | 125.40 us | 2281.2500 | 375.0000 | 156.2500 | 18.37 MB | |
Test | 512 | 1048576 | 30,795.2 us | 614.23 us | 880.90 us | 4468.7500 | 437.5000 | 187.5000 | 35.66 MB |
OpenMcdf v3 Proof of Concept | Method | BufferSize | TotalStreamSize | Mean | Error | StdDev | Gen0 | Allocated |
---|---|---|---|---|---|---|---|---|
Test | 1048576 | 1048576 | 59.28 us | 0.150 us | 0.125 us | 0.1221 | 1.96 KB | |
Test | 524288 | 1048576 | 60.57 us | 0.117 us | 0.092 us | 0.1831 | 1.96 KB | |
Test | 262144 | 1048576 | 59.48 us | 0.099 us | 0.083 us | 0.1831 | 1.96 KB | |
Test | 131072 | 1048576 | 60.04 us | 0.790 us | 0.739 us | 0.1831 | 1.96 KB | |
Test | 4096 | 1048576 | 62.56 us | 1.119 us | 1.046 us | 0.1221 | 1.96 KB | |
Test | 1024 | 1048576 | 76.02 us | 0.265 us | 0.248 us | 0.1221 | 1.96 KB | |
Test | 512 | 1048576 | 75.58 us | 0.997 us | 0.933 us | 0.1221 | 1.96 KB |
So, there's some pretty big performance (400x faster for short reads) and memory reduction (Gen0 GCs are drastically reduced, and Gen1/2 GCs are eliminated) wins to be had on reading, while also enforcing reasonably strict validation.
In the proof of concept, BinaryReader and BinaryWriter are extended to handle CFB types, and there is only one reader and one writer stored in a context (along with the header) that is shared across objects that need access to it. Sectors are lightweight structs mostly to record their ID and map to their position within the CFB stream/file. There are a couple of enumerators that do the main work: FatSectorEnumerator: Enumerates the FAT sectors from the Header's DIFAT array and the DIFAT chain. FatSectorChainEnumerator: Enumerates a chain of FAT sectors for an entry/directory/FAT
Although I haven't done any code to write data yet, I'm thinking the enumerators might be converted to mutable iterators which should also be reasonably fast/efficient. I'll share some code when I've cleaned it up and progressed a bit further!
Now that v2.4 looks like it's coming together. I was wondering what folks thoughts were for objectives for v3?
If I understand correctly, the values/objectives for v2 are:
- Pure dotnet implementation
- Maximized client compatilibilty (i.e. netstandard2.0, net4.0)
- Easy traversal of storages/streams
- Easy manipulation of stream data
There are some goals I have in mind for v3:
- Support 16 TB files (i.e. the maximum 0xFFFFFFFA sector count and therefore uint sector IDs)
- Support transactions (e.g. scratch data rather than snapshot copy)
- Support consolidation on commit (e.g. online rather than copy)
- Revised API to follow dotnet conventions (e.g. CFStream by implementing Stream directly instead of via a decorator)
- Idiomatic exception hierarchy (Review exception hierarchy for v3 #146)
- Improved performance
- Reduced memory usage
- Nullable attributes/static analysis
Other thoughts:
- Multi-targeting for netstandard2.0 and net8.0
- Spans (System.Memory if targeting netstandard2.0)
- async (Currently no async BinaryReader/Writer: Add async overloads to BinaryReader/Writer dotnet/runtime#17229)
Honestly speaking... it's a perfect summary. I think that 2.4 target is almost reached. I would not introduce new features in this branch since it has reached a certain maturity level and v3 should take good ideas from it and refactor them to allow a better logic separation and avoid up-and-down runs to allocate and persist sector chains since there lay the big performance penalties even if it's somehow a compact representation and working unit for cfb handling.
@ironfede Yesterday, I added a dedicated enumerator for directory entries in a FAT chain and another enumerator for the directory tree, so it can now traverse storages and streams as part of a tree (it could only traverse them as a list before) . I also improved the enumerators so they're a bit easier to follow and improved validation (enumerators and sector offsets throw if you try to access something that's invalid/out of bounds). I'll have a look at implementing support for mini FAT sectors today, then I think I'll be to the point where I'll have something worth sharing for comments and feedback. But essentially, aside from some clean-up and further validation/testing, I think the POC already meets the following objectives:
@Numpsy Looks like you might have a particular interest in the OLE Property Set Data Structures? Do you have anything you'd like to see for v3? I can't say I know too much about it, so aside from some nit-pick refactoring work my only real comment is that perhaps OpenMcdf.Extensions should be renamed to OpenMcdf.Ole (i.e. explicitly about OLE only). Especially since we probably won't require a decorator for streams in v3.
My current use case is reading and writing metadata (summary information etc) in Office documents, so things like really massive files aren't really an issue. The recent changes have shaved a nice amount of memory allocations from reading said properties, but as the time taken is sub-millisecond it's hard to measure changes (though any gain is nice).
As far as the API goes, I think it would be nice to review how it presents the different property sets - as well as SummaryInformation/DocumentSummaryInformation, there is some amound of support for others, but it's not clear how far the intended support goes. @farfilli has raised some issues in that area, so maybe he has some thoughts for possible improvements?
There's also scope for a more complete set of functions for adding/updating/deleting properties (e.g. https://github.com/ironfede/openmcdf/pull/190) - that's a higher level API than the changes to the storage part.
Now that v2.4 looks like it's coming together. I was wondering what folks thoughts were for objectives for v3?
If I understand correctly, the values/objectives for v2 are:
There are some goals I have in mind for v3:
Other thoughts: