lifthrasiir / rust-encoding

Character encoding support for Rust
MIT License
284 stars 59 forks source link

Readers? #91

Open marcusklaas opened 9 years ago

marcusklaas commented 9 years ago

It would be convenient to have an object that implements Read, so one could for example easily and efficiently read from a file in an encoding other than utf-8.

SimonSapin commented 9 years ago

It sounds like you want something that implements not std::io::Read (which is a stream of bytes) but another trait for a Unicode stream. But as discussed in this RFC: https://github.com/rust-lang/rfcs/pull/57, doing it for reading is tricky. The bytes one takes a &mut [u8] argument, writes to it, and returns the number of written bytes. But doing that with &mut str might require some zeroing, or something. The contents of str must be well-formed UTF-8.

I’m experimenting with things that could help here. I’ll post again where there’s something more fully formed to show.

marcusklaas commented 9 years ago

Sorry for my vague description. I meant some kind of adapter between a stream of bytes in for examples Windows-1252 and a stream of bytes in utf-8. The unicode stream would be very nice, but there's a lot of code that already works with std::io::Read.

SimonSapin commented 9 years ago

That sounds like it could be built on top of "raw" decoders.

SimonSapin commented 9 years ago

… probably with an impl of encoding::types::StringWriter for &mut [u8], to be used with the argument to Read::read.

bbigras commented 8 years ago

Any progress? Anything changed since last time that would make it easier?

mitsuhiko commented 7 years ago

I just came across the same myself. Would this be something that is in the scope of the crate?

BurntSushi commented 7 years ago

I have to write these impls for a project of mine and would also like to hear whether @lifthrasiir thinks they might be in scope for this crate.

I've also started a conversation on the encoding_rs crate: https://github.com/hsivonen/encoding_rs/issues/8

BurntSushi commented 7 years ago

To cross pollinate a bit here from the encoding_rs crate... @SimonSapin and I worked on our own versions of Read trait implementations (except @SimonSapin did quite a bit more!). @SimonSapin's work is in this PR: https://github.com/hsivonen/encoding_rs/pull/9 My work is here: https://github.com/BurntSushi/ripgrep/blob/75f1855a91ca00b5d0e62740595b1b91bc5142a2/src/decoder.rs

The big idea here is that implementing these traits is quite tricky, and neither of our implementations is fully correct. Mine gets most of the way there, but doesn't support single-byte-reads, which means the bytes adapter method doesn't work at all. It's possible to make this work, but requires a bit more book-keeping.

mitsuhiko commented 7 years ago

I wonder if the traits are misdesigned for non utf-8 usage. It's weird that they work with both strings and bytes.

BurntSushi commented 7 years ago

In my case, I very much wanted to ever avoid materializing a &str and the costs associated with it. So operating on &[u8] is perfect.