tahonermann / text_view

A C++ concepts and range based character encoding and code point enumeration library
MIT License
122 stars 14 forks source link

Support streaming code unit sequences by saving incomplete code unit sequences as encoding state #15

Open tahonermann opened 8 years ago

tahonermann commented 8 years ago

Consuming code unit sequences from a streaming source may result in attempts to decode a partial code unit sequence. At present, an exception will be thrown when such underflow occurs. An alternative would be to store the partial code unit sequence in the iterator state and then have the iterator compare equally to the end iterator. This would enable code like the following to work correctly even if buffer ends fail to fall on a code unit sequence boundary.

using encoding = utf8_encoding;
auto state = encoding::initial_state();
do {
   std::string b = get_more_data();
   auto tv = make_text_view<utf8_encoding>(state, begin(b), end(b));
   auto tv_it = begin(tv);
   while (tv_it != end(tv))
     ...;
   state = tv_it;  // Trailing state is in tv_it, preserve it
                   // to seed state for the next iteration.
} while(!b.empty());

A problem with this approach is that it leaves open the possibility for trailing code units (e.g., garbage at the end of the encoded text) to go unnoticed. Because of this, the behavior above probably shouldn't be the default behavior, but it should be possible for code to opt in to it; perhaps via a policy class as suggested in #14.

ruoso commented 7 years ago

I have been thinking about this topic (wrote this two things: https://github.com/ruoso/u5e/blob/master/StreamVsIterators.md https://github.com/ruoso/u5e/blob/master/StreamVsFormat.md )

I believe it's best if there is a more clear "firewall" between raw data and text. The code handling the specific streamed protocol (such as HTTP or IRC for instance) is in a much better position to validate the data before 'declaring' it to be text. Doing that in the iterator itself creates an undue burden on everyone handling that type of code.

tahonermann commented 7 years ago

I agree that ensuring proper data boundaries in packet oriented protocols is best practice. I think there will always be cases where that isn't possible though. In those cases, the only solutions I've found so far are for the iterator to throw an exception, block (on advancement of the underlying code unit iterator), or the approach described in the first comment of this issue.

The initial email thread where I requested feedback on text_view talked about some of these options. You can find it at: https://groups.google.com/a/isocpp.org/d/msg/std-proposals/Tu84_TQOlhc/lV0MdIq1HQAJ

ruoso commented 7 years ago

My point is that introducing that support is counter productive. It is a use case that only makes sense from a theoretical standpoint.

In practice, the industry consensus is that the only reasonable way to handle the distinction between what is semantically considered "text" and what is a "sequence of bytes" is by creating a strict type-safe firewall between code that handles text and code that handles bytes.

Any library support that weakens that firewall not only is not useful (since the network layer does need to be byte-by-byte precise), but it is actually harmful (because it leads developers into thinking it's possible to send "text" over a socket, when reality is way more complicated than that).

Em sáb, 24 de set de 2016 22:07, Tom Honermann notifications@github.com escreveu:

I agree that ensuring proper data boundaries in packet oriented protocols is best practice. I think there will always be cases where that isn't possible though. In those cases, the only solutions I've found so far are for the iterator to throw an exception, block (on advancement of the underlying code unit iterator), or the approach described in the first comment of this issue.

The initial email thread where I requested feedback on text_view talked about some of these options. You can find it at: https://groups.google.com/a/isocpp.org/d/msg/std-proposals/Tu84_TQOlhc/lV0MdIq1HQAJ

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tahonermann/text_view/issues/15#issuecomment-249398160, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE9K3iJ8xC-7KZEN0babtqN4xeMl3WLks5qtddHgaJpZM4HaxGH .

tahonermann commented 7 years ago

I think there are legitimate use cases. People stream text across command line pipes all the time. Granted, blocking and data loss tend not to be issues in those cases.

At any rate, addressing this issue is not high on my priority list. This issue was opened due to concerns raised in the email thread mentioned in https://github.com/tahonermann/text_view/issues/15#issuecomment-249398160