This PR addresses memory consumption on pptx PublishSlides.
PptxContent copy routine uses media duplicate checking. Images/media were loaded in memory all the time (as a cache, the content was copied to a byte array) and all media were compared against each other on cache lookup (byte by byte comparison). Now, SHA256 hash is computed on media content and stored in memory (256 bits per media data), so cache lookup is based on content type + hash. This optimization removes media byte content from memory and improves cache lookup speed (but needs to compute a cache).
This improvement also affects all operations where CopyMedia/Images are involved (not only PublishSlides).
Rough perf stats on my 1.8 GB pptx file (tested via dotMemory):
Method
Allocated objs
Allocated bytes
PeakMemory
Time
PublishSlides_Original
4,911,955
21,079,000,567
4,120,592,384
00:01:16.5428780
PublishSlides_Optimized
4,909,955
13,030,037,546
1,503,330,304
00:01:20.7666303
Peak memory is retrieved via Process.GetCurrentProcess().PeakWorkingSet64
This PR addresses memory consumption on pptx PublishSlides.
PptxContent copy routine uses media duplicate checking. Images/media were loaded in memory all the time (as a cache, the content was copied to a byte array) and all media were compared against each other on cache lookup (byte by byte comparison). Now, SHA256 hash is computed on media content and stored in memory (256 bits per media data), so cache lookup is based on content type + hash. This optimization removes media byte content from memory and improves cache lookup speed (but needs to compute a cache).
This improvement also affects all operations where CopyMedia/Images are involved (not only PublishSlides).
Rough perf stats on my 1.8 GB pptx file (tested via dotMemory):
Peak memory is retrieved via
Process.GetCurrentProcess().PeakWorkingSet64