statiqdev / Statiq.Web

Statiq Web is a flexible static site generator written in .NET.
https://statiq.dev/web
Other
1.65k stars 235 forks source link

Use streams for document content #42

Closed daveaglick closed 9 years ago

daveaglick commented 9 years ago

Instead of strings, use streams for binary content (see discussion in #25). This will allow documents to contain string or binary content. It should also yield better performance in some cases where transformations are optimized for streaming data (instead of having to read into a string). Will need to convert all existing modules over to streams. Should also add convenience getters and setters to convert from the stream from/to byte arrays and strings (in the case of getting data, handle reading the stream into the array and vise versa for setters).

daveaglick commented 9 years ago

I'm toying with the idea of using BlockingCollection<T> for module IO instead of Stream. The idea is that this would let us run the modules in parallel and as data comes in through the first one, it gets added to a BlockingCollection<T> shared by the second one, which processes as data is available and then blocks waiting on additional data from the first one and so on.

See http://blogs.msdn.com/b/pfxteam/archive/2010/04/14/9995613.aspx and https://github.com/slashdotdash/ParallelExtensionsExtras/blob/4df9a0843901d6449ee519a6cad828eb5a54a602/src/CoordinationDataStructures/Pipeline.cs

daveaglick commented 9 years ago

The more I think about this, I do keep coming back to just using byte arrays (or plain old strings) and buffering the data in one big block from module to module.

Consider this scenario: you have an image that needs to be resized to two different sizes. With streams (or some sort of stream-like collection) you'll have to read the entire stream to perform the first resize. Then you'll need to re-read the stream to perform the second resize. That means you're either going back to disk (slow!) or buffering the stream - which would be more efficient just by storing an passing the byte array to begin with. Consider also operations like string manipulation, find and replace, etc. It's all easier to code against with single primitive objects.

Of course, this isn't without problems (hence the consideration of streams in the first place).

dodyg commented 9 years ago

My questions are:

daveaglick commented 9 years ago

I've been giving this a lot of thought the last couple days and have finally decided on a way forward (thanks, as always, for the input @JimBobSquarePants and @dodyg). Normally I wouldn't go through so much hand-wringing and would just ship, but this is pretty fundamental aspect of a young project so I want to make sure to get it right. This is also going to be another long comment because I want to document the decision for my future self.

Wyam was created first and foremost because I saw a lack of static generators that could be used in more sophisticated scenarios with the ability to easily customize the content flow. Other generators are either so focused on a specific use case (like blogs) or require too much complicated up front work. The concept of easily manipulating string content is fundamental to this design goal, so I'm going to make sure that stays in. I don't want users to have to worry about manipulating streams if all they want to do is a search and replace or other simple mutations.

That said, there are also very good reasons why using strings under the hood won't be the best long-term solution. There are memory issues to contend with. There's also the matter of sending binary content through the pipeline. And it's been pointed out that encoding will become a factor too. The system has to accommodate streaming data in order to make sure we address all these potential pitfalls.

So, here's what I'm going to do:

dodyg commented 9 years ago

Great.

daveaglick commented 9 years ago

@dodyg - I still have some work to do to get all the in-built modules to use streams instead of strings, but both ReadFiles and WriteFiles now use the stream directly. The latest on the develop branch should be enough for you to start using with the ImageProcessor module (#25). You shouldn't need the ProcessFiles module any more: reading with ReadFiles should populate the IDocument.Stream property, which the image module can then directly read from to get the binary data. Just make sure that when cloning the return document with modifications you use one of the IDocument.Clone(...) overrides that takes a Stream.

daveaglick commented 9 years ago

Going to go ahead and close this out.

I decided against converting the other modules over to stream use. Reasoning is that they're either control flow (in which case they don't access content directly anyway) or string-based (Append, Concat, Replace, etc.) For the string based modules, even if we performed the manipulation in a streaming fashion, the results would still have to be read into memory when passed to the next module to support seeking. It's faster to just go ahead and read into a string in the first place and do the manipulation on IDocument.Content, passing that down the line until a stream is needed again. The only issue with this is if there's a possibility of manipulating really, really large blocks of text that might overflow the available contiguous memory.

Even if I did want to address contiguous memory issues, it's unclear how to do so. Either the seekable stream would have to be chunked (since a simple MemoryStream backing buffer would have the same issues with contiguous memory), some alternate chunked string representation would have to be used behind the scenes (like StringBuilder) and string avoided everywhere, or perhaps use a temporary file as a buffer (at the expense of performance). I'm not too worried about this right now - it can be addressed later if folks start reporting OutOfMemory exceptions.

The important thing is that both ReadFiles and WriteFiles now operate on streams so that modules that deal with binary data can now do so without issue.