sergey-tihon / Clippit

Fresh PowerTools for OpenXml
https://sergey-tihon.github.io/Clippit/
MIT License
47 stars 18 forks source link

feat: PublishSlides: reduced memory consumption #53

Closed f1nzer closed 2 years ago

f1nzer commented 2 years ago

This PR addresses memory consumption on pptx PublishSlides.

PptxContent copy routine uses media duplicate checking. Images/media were loaded in memory all the time (as a cache, the content was copied to a byte array) and all media were compared against each other on cache lookup (byte by byte comparison). Now, SHA256 hash is computed on media content and stored in memory (256 bits per media data), so cache lookup is based on content type + hash. This optimization removes media byte content from memory and improves cache lookup speed (but needs to compute a cache).

This improvement also affects all operations where CopyMedia/Images are involved (not only PublishSlides).

Rough perf stats on my 1.8 GB pptx file (tested via dotMemory):

Method Allocated objs Allocated bytes PeakMemory Time
PublishSlides_Original 4,911,955 21,079,000,567 4,120,592,384 00:01:16.5428780
PublishSlides_Optimized 4,909,955 13,030,037,546 1,503,330,304 00:01:20.7666303

Peak memory is retrieved via Process.GetCurrentProcess().PeakWorkingSet64

sergey-tihon commented 2 years ago

Thank you @f1nzer !