Closed kouvas closed 2 months ago
Thanks for the report!
I wonder if this happens because buffering (done by Racket Mode); maybe the byte sequence is valid UTF-8, but it's falling over a boundary??
If in Emacs you (setq racket-pretty-print nil)
, and re-run, does the problem still occur?
I wonder if I could get a more-minimal example, that doesn't require needing to have all those packages installed. How you could do this, I think -- if it's not too inconvenient:
page
to (require racket/pretty) (pretty-print page)
.racket
.```
code fence blocks.But if that's a PITA, or the resulting value is too large or too sensitive, then no worries.
I've also encountered this issue recently, and I agree that it looks like some kind of boundary issue. Here's a file that reproduces the problem:
Generated by:
(require net/http-easy)
(response-xexpr (get "https://daringfireball.net/thetalkshow/rss"))
Running
(call-with-input-file "test.txt" read)
in a brand new racket-mode repl reproduces the issue for me. Wrapping the result in (void ...)
, avoids the problem. The problem does seem to occur even when racket-pretty-print
is nil
.
Awesome, thanks!
I think the fix may be as simple as:
modified racket/print.rkt
@@ -52,7 +52,7 @@
(let loop ()
(match (read-bytes-avail! buffer in)
[(? exact-nonnegative-integer? len)
- (define v (bytes->string/utf-8 (subbytes buffer 0 len)))
+ (define v (bytes->string/latin-1 buffer #f 0 len))
(repl-output-value v)
(loop)]
[(? procedure? read-special)
I'm not sure what I was thinking, because bytes->string/utf-8
isn't even needed here -- Racket print
has already done any UTF-8 conversion. So not only is it unnecessary it can cause this boundary problem.
I haven't looked into how the print module works, so I might be totally wrong, but assuming the results of pretty-print
or print
are being written into the byte string, then the contents of the byte string are going to be utf-8 encoded, so it seems right to want to decode them and trying to decode them as latin-1
probably avoids the exception but decodes the wrong data at a boundary[1]. Ideally, the contents of buf
would get written to the other end directly without any decoding, but if repl-output-value
has to take a string, then it might be better to check if (bytes-ref buffer (sub1 len))
is a continuation byte (i.e. its most-significant bit is 1) and wait for more data before performing the conversion.
[1]:
> (define buf (string->bytes/utf-8 "λ"))
> (for ([i (in-range 2)]) ;; assuming read-bytes-avail! returns one byte at a time
(displayln (bytes->string/latin-1 buf #f i (add1 i))))
Î
»
OK, there are a few levels of buffering going on here, and I need to reload my brain with some of the details. For example it's possible the right answer is to preserve these as bytes, at this stage, and attempt conversion only later. I'll give it a think...
p.s. A quick hack would be to supply a non-false error-char
to bytes->string/utf-8
. That would prevent the crash, but it would leave "unknown" characters in the output, gratuitously IIUC. So I need to think through the whole thing.
Package
System values
Buffer values
Racket Mode values
Minor modes
Disabled minor modes
Steps to reproduce:
The code
only when sending
page
to repl, it produces this error