Closed jurgenvinju closed 14 hours ago
The first application of this method would be util::Webserver
and util::REPL
of the Rascal standard library. These serve, possibly enormous, IString values to Inputstreams
of the webserver Response fields. There is still the matter of mapping the Reader
to the respective Inputstream
to be resolved.
I like the fact that we return a different way to consume an IString (instead of just a iterator). It does allow people to introduce unicode bugs again. (The IString methods are all safe with regards to surrogate pairs, while Reader
is not).
how about a different approach: InputStream IString::serialize(Charset)
? that still gives you a stream of the encoded string, but avoids people doing string stuff with it.
If someone passes in a different charset than DefaultCharsets.UTF8
that's completly up to them right?
We could add a empty version that always picks UTF8, but that is quite arbitrary.
Regarding the duality: the flipping of the direction in the stream is always a bit strange to me. Instead of the REPL and the Webserver exposing the Writer/OutputStream to write to, they require an InputStream to read from, which will just cause yet another buffered read party. I actually do not like that model, so we could wonder if we want to even add that to IString. (I think the Writer
one is the cleanest api, this InputStream
/Reader
feels inverse)
It's the difference between push and pull. It depends on which end of a pipe Rascal code ends up.. so supporting both streaming from input and to output seems reasonable.
I don't have a choice to support Inputstream as well because most of the web containers wrap Inputstream in their response wrappers.
It's the difference between push and pull. It depends on which end of a pipe Rascal code ends up.. so supporting both streaming from input and to output seems reasonable.
In my opinion, it's not streaming from input since that would be a static IString IString::from(InputStream)
. The current discussed feature is not the dual of the Writer, it's more a different interface style. Maybe you could say that the character iterator is the dual of the writer?
I don't have a choice to support Inputstream as well because most of the web containers wrap Inputstream in their response wrappers.
Okay, to review:
I think we misunderstand eachother. This is a different dualism than reading vs writing. This is the difference between push and pull, for reading. Either the IString pushes to a writer, or the IString offers a reader to pull from. Depending on which side of a pipe our code ends up, we need one or the other.
IString.asReader
/IString.asInputStream
, because we can not predict what the context needs. Java also offers ByteArrayInputStreams and StringReaders in its standard library, and there is good reason for that. For the same reason, we also need InputStreams (or Readers) for IString. Okay, I would say that the character iterator is the pull, but if we really need to let's add a asInputStream
that requires a Charset argument.
Summary of why:
understood. and then we change write
to take an OutputStream and a CharSet as a parameter too right?
maybe, although writers are a lot safer to expose, you need to work quite hard to get a string out of it again.. (like there is no such thing as a readline)
and writers are quite common in a lot of our infrastructure
Ok. One more question. Why would a reader need an encoding parameter again? Our IStrings are all encoded the same way, and so wouldn't the read
method of IString::asReader always produce always the Java encoding?
Is it because the read method takes the maximum number of 16bit chars to read as parameter, and so might accidentally split a 24-bit character in half?
Is this why you propose an inputstream? How does that solve the above issue, and wouldn't the encoding again be the same? I mean our IStrings are always encoded the same, so if you read from them you get Java encoding with the surrogate pairs, not utf8 or any other encoding unless we inject some on-the-fly recoders, which seems superfluous..
First to clarify, let me recap some details about IString, java Strings and Charsets.
abc
is 3 codepoints long, 🍝
is 1 codepoint. lang.java.String
: internally represented as a UTF16 encoding of the string literal. Instead of using 2 to 4 bytes, they expose every codepoint as 1 or 2 16-bit chars. abc
is 3 chars, 🍝
is 2 chars.IString
: to avoid all issues with java chars & strings, we model our own strings and operations on them. internally we use java string, but externally we only expose full code points. Most unicode bugs that are still happening are caused by someone calling toString
and just calling method on that regular java string.now to answer your questions:
Ok. One more question. Why would a reader need an encoding parameter again? Our IStrings are all encoded the same way, and so wouldn't the read method of IString::asReader always produce always the Java encoding? && StringReader does not get a CharSet parameter either.
As long as strings remain in Java memory, encoding does not make sense. So that is why reader has no charset parameter, it doesn't make sense, a reader is all pure java strings and chars, no bytes involved.
Is it because the read method takes the maximum number of 16bit chars to read as parameter, and so might accidentally split a 24-bit character in half?
For example if you call int Reader::read()
it only returns a single java char (the int type is to be able to return -1 as a EOF signal). This will break around full-width codepoints. The other methods that take char buffers have similar issues, your code now has to handle messy stuff like the last char in a buffer being a surrogate pair.
Is this why you propose an inputstream? How does that solve the above issue, and wouldn't the encoding again be the same?
My reason to propose InputStreams are:
IString
, as that has always been a source of bugs. the Reader
interface gives you a way to easy way to get ranges of chars and strings.IString
asInputStream
function to IString, it's a bit leaky, but it would allow for faster implementations.I mean our IStrings are always encoded the same, so if you read from them you get Java encoding with the surrogate pairs, not utf8 or any other encoding unless we inject some on-the-fly recoders, which seems superfluous..
Since I don't really understand the question (sorry). I'll try to answer the question about why the inputstream need an encoding. As a string gets converted to a stream of bytes, the consumer has to choose which kind of encoding is needed. The protocol that the contents need to be send to might enforce UTF8, or UTF32, it all depends, IString cannot make a choice for all consumers of bytes.
This has been implemented.
IString::write(Writer w)
method.Writer
.Writer
in Java is aReader
IString::asReader
Implementation considerations:
asReader
method lifts on this thing, or it duplicates some of its design to replicate a readerbufferSize
to its clients usingSystem.arraycopy
, while the character iterator always goes one-by-one.