integrate with standard `Read`/`Write` IO traits

vlmutolo commented 3 years ago

Summary

The Rust ecosystem heavily integrates with std's Read and Write traits. It would be helpful to interoperate with these traits, especially in the context of the aead and hash modules.

Example

Hashing

use orion::hash::{self, Digest};
use std::io::Cursor;

let data = Cursor::new(vec![0, 1, 2, 3, 4]); 
let digest: Digest = hash::from_reader(&mut data);

AEAD

use orion::aead;
use std::io::Cursor;

let data = Cursor::new(vec![0, 1, 2, 3, 4]);
let secret_key = aead::SecretKey::default();

// Encrypt from a `Read` type.
let mut encrypted_data = Vec::new();
aead::seal_copy(&secret_key, &mut data, &mut encrypted_data)?;

// Decrypt into a `Write` type.
let mut decrypted_data = Vec::new();
aead::open_copy(&secret_key, &mut encrypted_data, &mut decrypted_data)?;

I'm definitely open to suggestions on the API described here. Particularly, I'm not sure if we'd want convenience wrappers around the *_copy functions for the AEAD case. The copy functions are fairly general, and will allow for people to allocate buffers how they want. But there's probably a case to be made to provide a simpler type that just outputs a Vec of encrypted/decrypted data.

In any case, the real advantage of implementing these APIs is that someone could do, for example, the following:

use std::fs::File;
use orion::hash;

let file = File::open("filename.txt").unwrap(); // may want to use a BufReader (?)
let digest = hash::from_reader(file).unwrap();

And it would work as expected. The big deal here is that large files should be read and hashed in pieces, which currently requires reaching into the hazardous module and using the "streaming interface" (the update method). This looks something like the following.

use orion::hazardous::hash::blake2b::{Blake2b, Hasher, SecretKey};

// Figuring out the "right" size isn't something users should have to do.
let mut state = Blake2b::new(None, 32)?;
let mut reader = File::open("filename.txt")?;
let mut buf = Vec::new();

while let 0 < reader.read(&mut buf) {
    state.update(buf.as_slice())?;
    buf.clear();
}
let digest = state.finalize()?;

So it's longer, has a little more room for user error, and in general we probably just want to support the existing IO traits.

I already started the implementation, and I'll post a draft PR soon.

Also, I considered making AsyncRead and AsyncWrite part of this issue/PR, but that seems like it should be its own thing. There's talk about integrating those traits into the standard library some time soon, so maybe we should hold off until then.

brycx commented 3 years ago

I'll take a closer look at this ASAP, but for now I very much agree with leaving AsyncRead/AsyncWrite as an aside.

brycx commented 3 years ago

AEAD example This looks very much like the high-level API to me at this point (except of course for the traits). Would we just add support for Read/Write to simply supports these traits in the API of AEAD? Don't we want to consider things like the streaming API for AEAD, which could provide the same benefits when encrypting files as the hash example does?

Hash example The approach for using the streaming interface seems good to me. I am wondering though, if you intend to have update() take a Read argument or if you want to only provide a from_reader method, which automatically sets the buffer size, BLAKE2b output size etc? In the latter case, incremental hashing of files for example, would happen transparently to the user. However, some users might want to decide buffer size themselves, which is only possible with the former.

vlmutolo commented 2 years ago

Don't we want to consider things like the streaming API for AEAD, which could provide the same benefits when encrypting files as the hash example does?

Yes, I think this would be good to support. I'm unclear, though, how to directly convert between the streaming API and the Read and Write traits. Read and Write both work on streams of bytes broken up into chunks of indeterminate size, while it seems like the AEAD streaming API relies on having precisely sized and separated messages to encode/decode. Is there a standard libsodium serialization of these messages into a byte stream?

I am wondering … if you want to only provide a from_reader method, which automatically sets the buffer size, BLAKE2b output size etc?… users might want to decide buffer size themselves

For the output size, this isn't something that Orion currently exposes in the safe API. And if users want to set it, then they can grab the Blake2b type directly and use its Write implementation. The from_reader function would basically just be a safe wrapper around creating a Blake2b type with sane parameters, callingBlake2b's Write impl and calling the std::io::copy function.

For the "buffer", I'm not sure which buffer you mean. If users want to buffer the reader, they can use a BufReader and then pass that to the from_reader function.

brycx commented 2 years ago

For the "buffer", I'm not sure which buffer you mean. If users want to buffer the reader, they can use a BufReader and then pass that to the from_reader function.

I think I understand now, I think I misunderstood the purpose of from_reader() initially.

brycx commented 2 years ago

Is there a standard libsodium serialization of these messages into a byte stream?

I see your point, in that this can be tricky. Especially since streaming write directly to an output, and has no internal state. Perhaps, in that case each call to seal_chunk()/open_chunk() would just need a reader and writer as the current API suggestion is for non-streaming AEAD.

I'll have to think about this one a bit more. If we don't find a solution to this, I definitely think that we should start with adding the integration to the hashing primitives and non-streaming AEAD, and then see how that turns out. WDYT?

I was also thinking that perhaps the naming of the functions for non-streaming AEAD could be made a bit less generic. How about sealing_read()/opening_write() or seal_read()/seal_write()?

In any case, I really like the idea behind the hashing primitives so far. This should definitely go in. Regarding another io feature, I don't think this is necessary. If we just gate it behind safe-api that should be fine, just like we do with the std:error:Error trait.

vlmutolo commented 2 years ago

each call to seal_chunk()/open_chunk() would just need a reader and writer

Oh, interesting. Yeah this probably would work. Instead of having a single call to e.g. sealing_read() handle all the individual streaming chunks, the user would call sealing_read once per chunk. That sounds good to me.

If we don't find a solution to this, I definitely think that we should start with adding the integration to the hashing primitives and non-streaming AEAD

Yes, agreed. Hashing is pretty straightforward here. Basically just consume any reader and output the digest. Though I do see where the user may want to control the size of the buffer we read into (which I see now is probably what you meant above). Maybe we could do something like internally allocate a Vec<u8> to read into and grow it dynamically until reads don't fill the whole buffer. Kind of like TCP slow-start.

How about sealing_read()/opening_write() or seal_read()/seal_write()?

I like these names. Not sure about sealing_ vs seal_. They're both good. I think I like sealing_read and sealing_write better, but not by much. That's what I'll go with unless you want to switch them.

For the hash names, are you still good with from_reader or should we switch to something again less generic? Maybe digest_from_reader or consume_to_digest? I think digesting_read sounds a bit odd haha.

brycx commented 2 years ago

Though I do see where the user may want to control the size of the buffer we read into (which I see now is probably what you meant above). Maybe we could do something like internally allocate a Vec to read into and grow it dynamically until reads don't fill the whole buffer. Kind of like TCP slow-start.

Right now I'm kind of lost in all these buffers and read/write directions/sources tbh. Do you perhaps have a small example?

That's what I'll go with unless you want to switch them.

Fine by me. The only downside I see is the small added consistency with the existing method naming of seal/open when doing seal_read/open_write instead. I'm open to both and will let you decide.

For the hash names, are you still good with from_reader or should we switch to something again less generic?

Good point. I very much like the digest_from_reader name. Plays well with the current digest function IMO.

I think digesting_read sounds a bit odd haha.

Almost tempting haha.

vlmutolo commented 2 years ago

Do you perhaps have a small example?

Here's what the API might look like.

let reader = File::open("file.txt")?;
let digest = orion::hash::digest_from_reader(reader)?;

Nice and easy. Internal to the digest_from_reader function, we could do something like the following to achieve somewhat of an "ideal" buffer size.

fn digest_from_reader(mut reader: impl Read) -> Result<Digest> {
    // Start with a buffer of 1KB (not sure what this initial value should be)
    let mut read_buffer = vec![0; 1024];
    loop {
        let bytes_read: usize = reader.read(&mut read_buffer)?;
        if bytes_read == 0 { break; }
        if bytes_read == read_buffer.len() {
            // Double the size of the buffer if we filled it up last time.
            read_buffer.extend(std::iter::repeat(0).take(read_buffer.len()));
        }

        // Hash the slice given by: &read_buffer[..bytes_read]
    }
}

brycx commented 2 years ago

That is a very straight-forward API indeed. Thanks for the example! I think this is a good approach. We can figure out a good buffer size and document it clearly. Assuming users of digest_from_reader() typically pass readers will large amounts of data (an assumption which should probably also be documented if taken), we can take a buffer size of say of 128 or 256 KiB? The reason behind this suggestion stems mostly from the fact that these are the two largest sizes used in benchmarks.

Edit: If we take such large buffers, we probably don't want to resize them dynamically though.

vlmutolo commented 2 years ago

Allocating a buffer of 256 KiB seems reasonable. Very few applications would be affected by that use of memory, and those that are can use the hazardous module and Blake2b's implementation of Read. And no, we probably wouldn't want to keep growing that buffer.

brycx commented 2 years ago

implementation of Read

Just making sure, but this is the Write impl right? I saw no Read impl in the draft PR AFAIR.

Regardless, sounds like a good plan.

vlmutolo commented 2 years ago

Yes, exactly. I meant Write.

orion-rs / orion