Add a text asynchronous reader

szymonwieloch commented 3 years ago

Tokio has the io::BufReader struct for reading buffered bytes and the io::Lines struct for reading text lines asynchronously. Quite surprisingly it does not have any asynchronous reader that would asynchronously return pieces of UTF8 text once they are available.

My code at the moment is reading data from a byte stream. There is no guarantee that this string will always contain valid UTF8 characters and also - because some UTF8 character consist of several bytes - you can get only a partial character. My code cannot wait for the end of line. I would like to be able to obtain parts of text as soon as they are available. It seems that io::Lines does exactly that but also adds line splitting. It would be possibly best to split Lines into two layers - one just parsing text and the second splitting text into lines.

Please notice that exactly this approach is used by this create:

https://docs.rs/textstream/0.1.1/textstream/

Unfortunately this implementation is synchronous and will not work in my case.

eb-64-64 commented 3 years ago

I'm interested in working on this. I'll post a suggested solution soon. I will note, however, that neither encoding (which textstream uses) nor encoding-rs are at 1.* yet, which means we can't expose them in a public API, and although we could use one of them in the implementation, that would require adding an extra dependency to tokio, which may not be desired.

Darksonn commented 3 years ago

I don't think we want to add new dependencies for this feature. It's relatively simple to implement. I would probably put it in tokio_util::io.

eb-64-64 commented 3 years ago

That was my thought, since a new dependency adds more overhead and UTF-8 is not too complex to parse. One thing I'm hung up on is whether such a solution should be conceptually (although probably not literally, since Stream is still unstable) a Stream of chars (or Option<char>s, for invalid UTF-8), or a Stream of Strings, where each String is text that is yielded as soon as it is available. Although if this was in tokio_util, would it be fine to implement it as a futures::Stream? My other question is whether this should be implemented using AsyncRead or AsyncBufRead. I'm unsure of the exact properties, benefits, and drawbacks of both, though I do understand that buffered reading can be more efficient because it uses fewer syscalls. However, if the goal is to obtain bits of text as soon as possible, should it be implemented with AsyncRead, so it doesn't spend time collecting more data into a buffer?

Darksonn commented 3 years ago

I don't think a Stream is a good idea. I was imagining that you have an AsyncRead or AsyncBufRead that withholds a few bytes if it got half of an utf-8 character.

eb-64-64 commented 3 years ago

Oh, so we have some struct TextReader<T: Async{Read, BufRead}> that yields all bytes it receives from T, except for incomplete UTF-8 characters?

That seems better. Do you think it should be implemented as AsyncRead or AsyncBufRead? I think AsyncBufRead would be better, because with AsyncRead, I don't know how a type could inspect the bytes read to check for UTF-8 characters without having its own buffer, which would make it more of an AsyncBufRead anyways, and would introduce more copies.

Another unresolved question is how to handle invalid UTF-8. I suppose if we go with a lower-level API like the one you suggested, it's up to the user whether to use from_utf8 or from_utf8_lossy. In that case, the best solution (in my opinion) is to yield the invalid bytes immediately. Another option is to discard invalid bytes, though that might require introducing copies, and would not be a good API in my opinion because a user might want those bytes, or at least to know that those bytes were there. If we wanted to go all the way, we could introduce a configuration option that allowed a user to specify whether to yield the bytes as-is, replace them with the replacement character, discard them, panic, or do something else. But if it were up to me, I would just yield invalid bytes as-is, as soon as they were available, and then let the user decide what to do with those bytes.

luben commented 3 years ago

Another unresolved question is how to handle invalid UTF-8.

I think the right approach is to replace the invalid sequences with the unicode REPLACEMENT CHARACTER (for UTF-8 is 0xEF 0xBF 0xBD).

eb-64-64 commented 3 years ago

The issue with that is that I can't mutate the bytes obtained from the underlying reader, so I'd have to have my own buffer, which would be expensive. If I just dumbly forward invalid UTF-8 along, then the user can use std::str::from_utf8_lossy to do what you say, replace invalid bytes with the Unicode REPLACEMENT CHARACTER, use std::str::from_utf8 to return an error or panic using unwrap_* or expect, or use std::str::from_utf8_unchecked if they're absolutely certain that it will be valid UTF-8.

Any thoughts Alice?

tokio-rs / tokio

Add a text asynchronous reader #3640