itod / pegkit

'Parsing Expression Grammar' toolkit for Cocoa/Objective-C
MIT License
392 stars 37 forks source link

Implement [self unread] when using streams for tokenization. #25

Open ewanmellor opened 9 years ago

ewanmellor commented 9 years ago

Implement parsing of UTF-8 characters when using streams for tokenization.

[self unread] is used by many of the PKTokenizerState subclasses, so without this feature, tokenization of streams is basically useless. This change adds a circular buffer for all the data read from the stream, and rewinds through this buffer to handle unreads. This places a limit on the amount of rewinding that can be done (defaults to 256 unichars) but that should be OK for practical purposes.

The UTF-8 support brings stream tokenization up to the same support as for strings. The latter uses NSString.characterAtIndex to get UTF-16 code points, and returns those from [self read]. For streams the parsing is not as simple, but the result is now the same.

This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for streams. Otherwise, the code behaves as before (returning data byte-by-byte) for backwards compatibility.

This includes code derived from http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7534.57.3/wtf/unicode/UTF8.cpp

That code has a BSD-style license and is marked as follows: