sliekens / Txt

A text parsing framework for .NET.
MIT License
2 stars 4 forks source link

Internal buffering and support for variable width character encodings #9

Open sliekens opened 9 years ago

sliekens commented 9 years ago

The initial implementation only handles single byte character encodings. For version 2.0, it would be nice to have support for multibyte character encodings.

Implementing support for fixed width multibyte encodings is trivial (UTF-16, UTF-32). Implementing support for variable width encodings isn't (UTF-8).

The blocking issue is the need for an internal buffering mechanism that can be "unbuffered" (de-buffered?).

The ability to buffer bytes is important for variable-width encodings where we don't know in advance how many bytes are in the next character. It's generally a good idea to buffer the maximum number of bytes per character to be read

buffer length = max byte count per character * n

The ability to unbuffer bytes is important for programs that accept mixed text input and binary input. That is: we don't want to read past the end of the text and consume binary data without offering a way to release those bytes back to the calling program. This is where System.IO.StreamReader fails dramatically: every StreamReader object has an internal buffer that is not publicly visible.

With the addition of the PushbackInputStream class and the ITextScanner.Reset() method, I think everything is in place to implement reusable buffers.

sliekens commented 9 years ago

Note to self: keep it al dente

Here's how not to do it: https://github.com/Microsoft/referencesource/blob/master/mscorlib/system/io/streamreader.cs