usethesource / vallang

Generic immutable recursive data representation API targeted at source code models and more.
Other
36 stars 12 forks source link

add IString::asReader to stream _from_ strings #71

Closed jurgenvinju closed 14 hours ago

jurgenvinju commented 4 years ago

Implementation considerations:

jurgenvinju commented 4 years ago

The first application of this method would be util::Webserver and util::REPL of the Rascal standard library. These serve, possibly enormous, IString values to Inputstreams of the webserver Response fields. There is still the matter of mapping the Reader to the respective Inputstream to be resolved.

DavyLandman commented 4 years ago

I like the fact that we return a different way to consume an IString (instead of just a iterator). It does allow people to introduce unicode bugs again. (The IString methods are all safe with regards to surrogate pairs, while Reader is not).

how about a different approach: InputStream IString::serialize(Charset)? that still gives you a stream of the encoded string, but avoids people doing string stuff with it.

jurgenvinju commented 4 years ago
DavyLandman commented 4 years ago

If someone passes in a different charset than DefaultCharsets.UTF8 that's completly up to them right?

We could add a empty version that always picks UTF8, but that is quite arbitrary.

DavyLandman commented 4 years ago

Regarding the duality: the flipping of the direction in the stream is always a bit strange to me. Instead of the REPL and the Webserver exposing the Writer/OutputStream to write to, they require an InputStream to read from, which will just cause yet another buffered read party. I actually do not like that model, so we could wonder if we want to even add that to IString. (I think the Writer one is the cleanest api, this InputStream/Reader feels inverse)

jurgenvinju commented 4 years ago

It's the difference between push and pull. It depends on which end of a pipe Rascal code ends up.. so supporting both streaming from input and to output seems reasonable.

I don't have a choice to support Inputstream as well because most of the web containers wrap Inputstream in their response wrappers.

DavyLandman commented 4 years ago

It's the difference between push and pull. It depends on which end of a pipe Rascal code ends up.. so supporting both streaming from input and to output seems reasonable.

In my opinion, it's not streaming from input since that would be a static IString IString::from(InputStream). The current discussed feature is not the dual of the Writer, it's more a different interface style. Maybe you could say that the character iterator is the dual of the writer?

I don't have a choice to support Inputstream as well because most of the web containers wrap Inputstream in their response wrappers.

Okay, to review:

  1. originally in the REPL I didn't add this InputStream abstraction (due to performance problems), it got added for the Notebook code. So for REPLs it's not needed to add this interface, we might have to think about a nicer way to abstract for the notebooks.
  2. For webservers: NanoHttp is quite simplistic in it's API, I've recently been using Jetty (yes, even managed to keep it in a single class without any xml), and that offers an outputstream to write the response to. We could consider switching to Jetty, that would also bring us better websocket support and all kinds of updates.
jurgenvinju commented 4 years ago

I think we misunderstand eachother. This is a different dualism than reading vs writing. This is the difference between push and pull, for reading. Either the IString pushes to a writer, or the IString offers a reader to pull from. Depending on which side of a pipe our code ends up, we need one or the other.

  1. For the REPL we eventually also depend on NanoHttp and the interfaces for Jupyter which also use InputStreams, which is the reason why we added the InputStreams.
  2. Switching to Jetty is an option, but not for now because it's a big impact change, and it would not help us with Jupyter, or with any other context which reads from IStrings.
  3. We should add a IString.asReader/IString.asInputStream, because we can not predict what the context needs. Java also offers ByteArrayInputStreams and StringReaders in its standard library, and there is good reason for that. For the same reason, we also need InputStreams (or Readers) for IString.
DavyLandman commented 4 years ago

Okay, I would say that the character iterator is the pull, but if we really need to let's add a asInputStream that requires a Charset argument.

Summary of why:

  1. The API of IString is made such that users never go back to regular java strings to do operations, adding a Reader would make it to easy to mess up again.
  2. When you want bytes out of a string, you have to think about encoding, there is no right default to make in IString.
jurgenvinju commented 4 years ago

understood. and then we change write to take an OutputStream and a CharSet as a parameter too right?

DavyLandman commented 4 years ago

maybe, although writers are a lot safer to expose, you need to work quite hard to get a string out of it again.. (like there is no such thing as a readline)

DavyLandman commented 4 years ago

and writers are quite common in a lot of our infrastructure

jurgenvinju commented 4 years ago

Ok. One more question. Why would a reader need an encoding parameter again? Our IStrings are all encoded the same way, and so wouldn't the read method of IString::asReader always produce always the Java encoding?

Is it because the read method takes the maximum number of 16bit chars to read as parameter, and so might accidentally split a 24-bit character in half?

Is this why you propose an inputstream? How does that solve the above issue, and wouldn't the encoding again be the same? I mean our IStrings are always encoded the same, so if you read from them you get Java encoding with the surrogate pairs, not utf8 or any other encoding unless we inject some on-the-fly recoders, which seems superfluous..

jurgenvinju commented 4 years ago
DavyLandman commented 4 years ago

First to clarify, let me recap some details about IString, java Strings and Charsets.

now to answer your questions:

Ok. One more question. Why would a reader need an encoding parameter again? Our IStrings are all encoded the same way, and so wouldn't the read method of IString::asReader always produce always the Java encoding? && StringReader does not get a CharSet parameter either.

As long as strings remain in Java memory, encoding does not make sense. So that is why reader has no charset parameter, it doesn't make sense, a reader is all pure java strings and chars, no bytes involved.

Is it because the read method takes the maximum number of 16bit chars to read as parameter, and so might accidentally split a 24-bit character in half?

For example if you call int Reader::read() it only returns a single java char (the int type is to be able to return -1 as a EOF signal). This will break around full-width codepoints. The other methods that take char buffers have similar issues, your code now has to handle messy stuff like the last char in a buffer being a surrogate pair.

Is this why you propose an inputstream? How does that solve the above issue, and wouldn't the encoding again be the same?

My reason to propose InputStreams are:

  1. I do not want to make it easier to get Java strings out of the IString, as that has always been a source of bugs. the Reader interface gives you a way to easy way to get ranges of chars and strings.
  2. Your question seems based on APIs that require an InputStream (a stream of bytes), since we already have pull based iterator, we can do things:
    1. Make a InputStream wrapper around the codepoint iterator of IString
    2. Add a asInputStream function to IString, it's a bit leaky, but it would allow for faster implementations.

I mean our IStrings are always encoded the same, so if you read from them you get Java encoding with the surrogate pairs, not utf8 or any other encoding unless we inject some on-the-fly recoders, which seems superfluous..

Since I don't really understand the question (sorry). I'll try to answer the question about why the inputstream need an encoding. As a string gets converted to a stream of bytes, the consumer has to choose which kind of encoding is needed. The protocol that the contents need to be send to might enforce UTF8, or UTF32, it all depends, IString cannot make a choice for all consumers of bytes.

DavyLandman commented 14 hours ago

This has been implemented.