Use streams for document content

daveaglick commented 9 years ago

Instead of strings, use streams for binary content (see discussion in #25). This will allow documents to contain string or binary content. It should also yield better performance in some cases where transformations are optimized for streaming data (instead of having to read into a string). Will need to convert all existing modules over to streams. Should also add convenience getters and setters to convert from the stream from/to byte arrays and strings (in the case of getting data, handle reading the stream into the array and vise versa for setters).

daveaglick commented 9 years ago

I'm toying with the idea of using BlockingCollection<T> for module IO instead of Stream. The idea is that this would let us run the modules in parallel and as data comes in through the first one, it gets added to a BlockingCollection<T> shared by the second one, which processes as data is available and then blocks waiting on additional data from the first one and so on.

See http://blogs.msdn.com/b/pfxteam/archive/2010/04/14/9995613.aspx and https://github.com/slashdotdash/ParallelExtensionsExtras/blob/4df9a0843901d6449ee519a6cad828eb5a54a602/src/CoordinationDataStructures/Pipeline.cs

daveaglick commented 9 years ago

The more I think about this, I do keep coming back to just using byte arrays (or plain old strings) and buffering the data in one big block from module to module.

Consider this scenario: you have an image that needs to be resized to two different sizes. With streams (or some sort of stream-like collection) you'll have to read the entire stream to perform the first resize. Then you'll need to re-read the stream to perform the second resize. That means you're either going back to disk (slow!) or buffering the stream - which would be more efficient just by storing an passing the byte array to begin with. Consider also operations like string manipulation, find and replace, etc. It's all easier to code against with single primitive objects.

Of course, this isn't without problems (hence the consideration of streams in the first place).

First, it means modules can't be run in parallel or async.
- That's probably okay because it was always intended that there would be a simple threading model at the pipeline level, and modules can always run internal operations (such as reading a bunch of files or performing string replacement) in parallel.
- Running async also means that tracing would get confusing, and I think being able to report back detailed and meaningful status to the user is a nice feature.
The bigger problem is memory usage. What if the image that needs to be resized is very large? Wouldn't loading one or more of those into memory as a single array cause potential out of memory exceptions? And what about very long string content - you don't have to exhaust the whole amount of memory to run out of contiguous blocks.
- One mitigation would be to remove references to content from documents as soon as they're no longer in the set being passed to the next module. That memory could then be collected. We could also remove content at the end of a pipeline - to the extent content needs to be available to other later pipelines (for example, to generate summaries of blog posts), it would have to be copied to metadata before the end of the pipeline.
- I'm also left to wonder if this isn't a case of premature optimization. How often are we likely to run into memory problems? There is probably (hopefully) a limit to how much content is contained in a single document.

dodyg commented 9 years ago

For text, you probably want to carry all of them in memory because of the usually long pipelines of processing over the content before persisting it to storage (file or db, or whatnot)
For binary, especially image, the pipeline is short. The ImageProcessor module can get away just by getting input path and output path (assuming the images are persisted in fs).

My questions are:

Does IDocument.Content have to be guaranteed to be populated while moving through the pipelines?
If IDocument.Content is already optional, then perhaps having one or two more different form of IDocument.Content (stream/byte array) in the interface isn't so bad.

daveaglick commented 9 years ago

I've been giving this a lot of thought the last couple days and have finally decided on a way forward (thanks, as always, for the input @JimBobSquarePants and @dodyg). Normally I wouldn't go through so much hand-wringing and would just ship, but this is pretty fundamental aspect of a young project so I want to make sure to get it right. This is also going to be another long comment because I want to document the decision for my future self.

Wyam was created first and foremost because I saw a lack of static generators that could be used in more sophisticated scenarios with the ability to easily customize the content flow. Other generators are either so focused on a specific use case (like blogs) or require too much complicated up front work. The concept of easily manipulating string content is fundamental to this design goal, so I'm going to make sure that stays in. I don't want users to have to worry about manipulating streams if all they want to do is a search and replace or other simple mutations.

That said, there are also very good reasons why using strings under the hood won't be the best long-term solution. There are memory issues to contend with. There's also the matter of sending binary content through the pipeline. And it's been pointed out that encoding will become a factor too. The system has to accommodate streaming data in order to make sure we address all these potential pitfalls.

So, here's what I'm going to do:

IDocument is going to continue to have a 'Content' property that can be used to get the string content of the document. It will continue to be guaranteed non-null.
I will add a property to IDocument called ContentStream which will contain a Stream of the document content. This will also be guaranteed non-null.
Under the hood, either a string will be stored or a stream or both. Additional IDocument.Clone(...) overrides will be added to support cloning with stream-based content. If a stream is initially set, it will be converted to a string using the default encoding when (and only when) IDocument.Content is requested. Likewise, if a string is initially set, it will be wrapped in a stream when (and only when) IDocument.ContentStream is requested. This way, simple documents that go through the pipeline as strings can stay that way, and more complex or binary documents that need stream support can have it.
When using a stream to clone a document, it will be checked for seekability. If it's not seekable, it'll have to be wrapped in a buffer (undecided if the infrastructure should force seekability and let the module worry about the buffer or if it should implicitly wrap non-seekable streams). There are too many use cases where document content has to be traversed more than once to allow forward-only streams.
The built-in modules will be rewritten to operate on streams. ReadFiles will return documents with a stream (instead of a string). This will help ensure that everything built-in to Wyam can work with extra-large documents and binary files. Because the streams will be implicitly converted to a string when accessing IDocument.Content, upstream modules shouldn't need to worry too much about this if they're just doing string manipulation. The built-it Replace, ReplaceIn, etc. should be re-written to work with the stream (and not string) where possible to avoid memory problems with very large documents, though.
I'm not going to worry about async and parallelizing at this point. Performance was never a primary design goal (though of course, it's always important). While it's certainly easy enough to debug multi-threaded applications from a developer's perspective, I think the nice sequential, indented tracing messages we can output when everything is synchronous provide more value to users of the tool than a performance gain would. That said, the operations within a single module can still be async (for example, ReadFiles can read all the requested files at once instead of sequentially).

dodyg commented 9 years ago

Great.

Regarding performance, I think people will care about development 'edit and refresh' experience but usually this is related to small number of content so in general it doesn't really matter.
Once we go to deployment though what people care most about will be reliability over performance. The software can take as much sweet time as it wishes but dear lord don't crash on the last 10 pages of the rendering.

daveaglick commented 9 years ago

@dodyg - I still have some work to do to get all the in-built modules to use streams instead of strings, but both ReadFiles and WriteFiles now use the stream directly. The latest on the develop branch should be enough for you to start using with the ImageProcessor module (#25). You shouldn't need the ProcessFiles module any more: reading with ReadFiles should populate the IDocument.Stream property, which the image module can then directly read from to get the binary data. Just make sure that when cloning the return document with modifications you use one of the IDocument.Clone(...) overrides that takes a Stream.

daveaglick commented 9 years ago

Going to go ahead and close this out.

I decided against converting the other modules over to stream use. Reasoning is that they're either control flow (in which case they don't access content directly anyway) or string-based (Append, Concat, Replace, etc.) For the string based modules, even if we performed the manipulation in a streaming fashion, the results would still have to be read into memory when passed to the next module to support seeking. It's faster to just go ahead and read into a string in the first place and do the manipulation on IDocument.Content, passing that down the line until a stream is needed again. The only issue with this is if there's a possibility of manipulating really, really large blocks of text that might overflow the available contiguous memory.

Even if I did want to address contiguous memory issues, it's unclear how to do so. Either the seekable stream would have to be chunked (since a simple MemoryStream backing buffer would have the same issues with contiguous memory), some alternate chunked string representation would have to be used behind the scenes (like StringBuilder) and string avoided everywhere, or perhaps use a temporary file as a buffer (at the expense of performance). I'm not too worried about this right now - it can be addressed later if folks start reporting OutOfMemory exceptions.

The important thing is that both ReadFiles and WriteFiles now operate on streams so that modules that deal with binary data can now do so without issue.

statiqdev / Statiq.Web

Use streams for document content #42