larcenists / larceny

Larceny Scheme implementation
Other
203 stars 32 forks source link

Incorrect UTF-8 invalid code point replacement handling #726

Closed ktakashi closed 9 years ago

ktakashi commented 9 years ago

I think the following script should print #\g twice but #\newline twice.

(import (rnrs))

(define buf-size 10)
(define bv (make-bytevector buf-size (char->integer #\a)))
(define (bytevector-append . bvs)
  (let* ((len (fold-left (lambda (sum bv) 
                           (+ (bytevector-length bv) sum)) 0 bvs))
         (r (make-bytevector len)))
    (fold-left (lambda (off bv)
                 (let ((len (bytevector-length bv)))
                   (bytevector-copy! bv 0 r off len)
                   (+ off len)))
               0 bvs)
    r))

(let ((bv2 (bytevector-append bv #vu8(#xe0 #x67 #x0a))))
  (call-with-port (transcoded-port 
                   (open-bytevector-input-port bv2) 
                   (make-transcoder (utf-8-codec)
                                    (eol-style lf)
                                    (error-handling-mode replace)))
    (lambda (in)
      (get-string-n in (+ 1 buf-size)) ;; read until invalid code point
      (write (get-char in)) (newline)))
  (call-with-port (transcoded-port 
                   (open-bytevector-input-port #vu8(#xe0 #x67 #x0a))
                   (make-transcoder (utf-8-codec)
                                    (eol-style lf)
                                    (error-handling-mode replace)))
    (lambda (in)
      (get-char in)
      (write (get-char in)) (newline))))
(flush-output-port (current-output-port))

Version: Larceny v0.98 "General Ripper" (Mar 7 2015 01:06:26, precise:Linux:unified)

WillClinger commented 9 years ago

Thank you for your report.

The specification of error-handling-mode in R6RS library section 8.2.4 says

If a textual input operation encounters an invalid or incomplete character encoding, and the error-handling mode is ignore, an appropriate number of bytes of the invalid encoding are ignored and decoding continues with the following bytes. If the error-handling mode is replace, the replacement character U+FFFD is injected into the data stream, an appropriate number of bytes are ignored, and decoding continues with the following bytes.

What does "an appropriate number of bytes" mean here? For UTF-8, there are at least three plausible interpretations of that phrase:

  1. Ignore just one byte, and resume reading with the byte that follows the first byte of the illegal sequence of bytes.
  2. Ignore the number of bytes implied by the first byte of the illegal sequence.
  3. Ignore all bytes of the illegal sequence up to and including the byte at which the illegality of the sequence can be detected, and resume reading with the following byte. (This interpretation corresponds to the viable prefix property of LL and LR parsers.)

All of those interpretations seem to be allowed by the R6RS standard. WIth the first interpretation, the test program writes #\g twice. With the second interpretation, the test program writes an end-of-file object twice. With the third interpretation, the test program writes #\newline twice.

So this is not a bug. It's just an example of a program whose behavior is not fully specified by the R6RS standard.