add IString::asReader to stream _from_ strings

jurgenvinju commented 4 years ago

IString's can stream to a writer using the IString::write(Writer w) method.
This is beneficial because IString's internally can be lazily concatenated and indented, and you don't want to duplicate the entire contents just to stream the contents
But, IStrings can not be streamed from yet, only to a Writer.
The dual of a Writer in Java is a Reader
So, let's add IString::asReader

Implementation considerations:

IString has a lazy UTF32 character iterator implementation already
Either the asReader method lifts on this thing, or it duplicates some of its design to replicate a reader
The trade-off is between code-reuse and buffering efficiency; a specialized Reader could dump entire lines or blocks of characters of bufferSize to its clients using System.arraycopy, while the character iterator always goes one-by-one.
Perhaps just wrapping the reader with a BufferedReader would have the same effect, and then we might reuse the (pretty complex and distributed) character iterator implementation?
Ask @DavyLandman about this.

jurgenvinju commented 4 years ago

The first application of this method would be util::Webserver and util::REPL of the Rascal standard library. These serve, possibly enormous, IString values to Inputstreams of the webserver Response fields. There is still the matter of mapping the Reader to the respective Inputstream to be resolved.

DavyLandman commented 4 years ago

I like the fact that we return a different way to consume an IString (instead of just a iterator). It does allow people to introduce unicode bugs again. (The IString methods are all safe with regards to surrogate pairs, while Reader is not).

how about a different approach: InputStream IString::serialize(Charset)? that still gives you a stream of the encoded string, but avoids people doing string stuff with it.

jurgenvinju commented 4 years ago

even better, an InputStream.. it's just a bit weird that we'd have a Writer on the on side and an InputStream on the other. Perhaps we should change the Writer stuff to OutputStream then? I like a little consistency with those dualities.
I propose we fix the CharSet at UTF8, otherwise we'd have to wire in different encoders ourselves. Or is that easy?

DavyLandman commented 4 years ago

If someone passes in a different charset than DefaultCharsets.UTF8 that's completly up to them right?

We could add a empty version that always picks UTF8, but that is quite arbitrary.

DavyLandman commented 4 years ago

Regarding the duality: the flipping of the direction in the stream is always a bit strange to me. Instead of the REPL and the Webserver exposing the Writer/OutputStream to write to, they require an InputStream to read from, which will just cause yet another buffered read party. I actually do not like that model, so we could wonder if we want to even add that to IString. (I think the Writer one is the cleanest api, this InputStream/Reader feels inverse)

jurgenvinju commented 4 years ago

It's the difference between push and pull. It depends on which end of a pipe Rascal code ends up.. so supporting both streaming from input and to output seems reasonable.

I don't have a choice to support Inputstream as well because most of the web containers wrap Inputstream in their response wrappers.

DavyLandman commented 4 years ago

It's the difference between push and pull. It depends on which end of a pipe Rascal code ends up.. so supporting both streaming from input and to output seems reasonable.

In my opinion, it's not streaming from input since that would be a static IString IString::from(InputStream). The current discussed feature is not the dual of the Writer, it's more a different interface style. Maybe you could say that the character iterator is the dual of the writer?

I don't have a choice to support Inputstream as well because most of the web containers wrap Inputstream in their response wrappers.

Okay, to review:

originally in the REPL I didn't add this InputStream abstraction (due to performance problems), it got added for the Notebook code. So for REPLs it's not needed to add this interface, we might have to think about a nicer way to abstract for the notebooks.
For webservers: NanoHttp is quite simplistic in it's API, I've recently been using Jetty (yes, even managed to keep it in a single class without any xml), and that offers an outputstream to write the response to. We could consider switching to Jetty, that would also bring us better websocket support and all kinds of updates.

jurgenvinju commented 4 years ago

I think we misunderstand eachother. This is a different dualism than reading vs writing. This is the difference between push and pull, for reading. Either the IString pushes to a writer, or the IString offers a reader to pull from. Depending on which side of a pipe our code ends up, we need one or the other.

For the REPL we eventually also depend on NanoHttp and the interfaces for Jupyter which also use InputStreams, which is the reason why we added the InputStreams.
Switching to Jetty is an option, but not for now because it's a big impact change, and it would not help us with Jupyter, or with any other context which reads from IStrings.
We should add a IString.asReader/IString.asInputStream, because we can not predict what the context needs. Java also offers ByteArrayInputStreams and StringReaders in its standard library, and there is good reason for that. For the same reason, we also need InputStreams (or Readers) for IString.

DavyLandman commented 4 years ago

Okay, I would say that the character iterator is the pull, but if we really need to let's add a asInputStream that requires a Charset argument.

Summary of why:

The API of IString is made such that users never go back to regular java strings to do operations, adding a Reader would make it to easy to mess up again.
When you want bytes out of a string, you have to think about encoding, there is no right default to make in IString.

jurgenvinju commented 4 years ago

understood. and then we change write to take an OutputStream and a CharSet as a parameter too right?

DavyLandman commented 4 years ago

maybe, although writers are a lot safer to expose, you need to work quite hard to get a string out of it again.. (like there is no such thing as a readline)

DavyLandman commented 4 years ago

and writers are quite common in a lot of our infrastructure

jurgenvinju commented 4 years ago

Ok. One more question. Why would a reader need an encoding parameter again? Our IStrings are all encoded the same way, and so wouldn't the read method of IString::asReader always produce always the Java encoding?

Is it because the read method takes the maximum number of 16bit chars to read as parameter, and so might accidentally split a 24-bit character in half?

Is this why you propose an inputstream? How does that solve the above issue, and wouldn't the encoding again be the same? I mean our IStrings are always encoded the same, so if you read from them you get Java encoding with the surrogate pairs, not utf8 or any other encoding unless we inject some on-the-fly recoders, which seems superfluous..

jurgenvinju commented 4 years ago

StringReader does not get a CharSet parameter either.

DavyLandman commented 4 years ago

First to clarify, let me recap some details about IString, java Strings and Charsets.

String: to simplify it a bit let's keep it a list of unicode codepoints. So abc is 3 codepoints long, 🍝 is 1 codepoint.
Charset / Encoding: is a way to define how a string is represented as a sequence of bytes. For example UTF8 is a variable length encoding between 1 to 6 bytes per codepoint, UTF16 is also variable, either 2 or 4 bytes, and UTF32 is always 4 bytes per character.
lang.java.String: internally represented as a UTF16 encoding of the string literal. Instead of using 2 to 4 bytes, they expose every codepoint as 1 or 2 16-bit chars. abc is 3 chars, 🍝 is 2 chars.
Java char: 16-bit char that maps onto utf16 encoding. when the code point doesn't fit inside the first 16-bit, it is split in a high and low surrogate pair. This is the root of MANY unicode related bugs in Java, most libraries/loops assume 1 char = 1 codepoint. Emoji's are a good example of codepoints that fall outside the 1 char range.
IString: to avoid all issues with java chars & strings, we model our own strings and operations on them. internally we use java string, but externally we only expose full code points. Most unicode bugs that are still happening are caused by someone calling toString and just calling method on that regular java string.

now to answer your questions:

Ok. One more question. Why would a reader need an encoding parameter again? Our IStrings are all encoded the same way, and so wouldn't the read method of IString::asReader always produce always the Java encoding? && StringReader does not get a CharSet parameter either.

As long as strings remain in Java memory, encoding does not make sense. So that is why reader has no charset parameter, it doesn't make sense, a reader is all pure java strings and chars, no bytes involved.

Is it because the read method takes the maximum number of 16bit chars to read as parameter, and so might accidentally split a 24-bit character in half?

For example if you call int Reader::read() it only returns a single java char (the int type is to be able to return -1 as a EOF signal). This will break around full-width codepoints. The other methods that take char buffers have similar issues, your code now has to handle messy stuff like the last char in a buffer being a surrogate pair.

Is this why you propose an inputstream? How does that solve the above issue, and wouldn't the encoding again be the same?

My reason to propose InputStreams are:

I do not want to make it easier to get Java strings out of the IString, as that has always been a source of bugs. the Reader interface gives you a way to easy way to get ranges of chars and strings.
Your question seems based on APIs that require an InputStream (a stream of bytes), since we already have pull based iterator, we can do things:
1. Make a InputStream wrapper around the codepoint iterator of IString
2. Add a asInputStream function to IString, it's a bit leaky, but it would allow for faster implementations.

I mean our IStrings are always encoded the same, so if you read from them you get Java encoding with the surrogate pairs, not utf8 or any other encoding unless we inject some on-the-fly recoders, which seems superfluous..

Since I don't really understand the question (sorry). I'll try to answer the question about why the inputstream need an encoding. As a string gets converted to a stream of bytes, the consumer has to choose which kind of encoding is needed. The protocol that the contents need to be send to might enforce UTF8, or UTF32, it all depends, IString cannot make a choice for all consumers of bytes.

DavyLandman commented 14 hours ago

This has been implemented.

usethesource / vallang

add IString::asReader to stream _from_ strings #71