Implement parsing of UTF-8 characters when using streams for tokenization.
[self unread] is used by many of the PKTokenizerState subclasses, so without
this feature, tokenization of streams is basically useless. This change adds
a circular buffer for all the data read from the stream, and rewinds through
this buffer to handle unreads. This places a limit on the amount of rewinding
that can be done (defaults to 256 unichars) but that should be OK for practical
purposes.
The UTF-8 support brings stream tokenization up to the same support as for
strings. The latter uses NSString.characterAtIndex to get UTF-16 code points,
and returns those from [self read]. For streams the parsing is not as simple,
but the result is now the same.
This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for
streams. Otherwise, the code behaves as before (returning data byte-by-byte)
for backwards compatibility.
Implement parsing of UTF-8 characters when using streams for tokenization.
[self unread] is used by many of the PKTokenizerState subclasses, so without this feature, tokenization of streams is basically useless. This change adds a circular buffer for all the data read from the stream, and rewinds through this buffer to handle unreads. This places a limit on the amount of rewinding that can be done (defaults to 256 unichars) but that should be OK for practical purposes.
The UTF-8 support brings stream tokenization up to the same support as for strings. The latter uses NSString.characterAtIndex to get UTF-16 code points, and returns those from [self read]. For streams the parsing is not as simple, but the result is now the same.
This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for streams. Otherwise, the code behaves as before (returning data byte-by-byte) for backwards compatibility.
This includes code derived from http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7534.57.3/wtf/unicode/UTF8.cpp
That code has a BSD-style license and is marked as follows: