rust-lang / git2-rs

libgit2 bindings for Rust
https://docs.rs/git2
Apache License 2.0
1.67k stars 384 forks source link

Support creating commits etc. with invalid utf-8 #371

Open mystor opened 5 years ago

mystor commented 5 years ago

Currently there are a few object types which provide getters for retrieving data as &[u8], as the data may be invalid utf-8, but don't provide the equivalent setters or constructors, for example:

  1. Commits have message_bytes(&self) -> &[u8], but Repository::commit(...) only takes the message as &str
  2. Signatures have name_bytes and email_bytes, but Signature::new and Signature::now only accept &str
  3. Config has get_bytes, but no set_bytes
    • In addition, the keys cannot be bytes for get_* or set_* on Config, making non-utf8 keys non-settable from git2-rs.

This seems to apply to most objects. For example, it also affects References, Reflogs, etc.

The main reason why this seems good to handle is to support round-tripping an object through git2 and back into git. If an object can exist in the repository, it seems like there should be some mechanism for creating that object through the git2 API.

For a small project I'm working on I'm mostly impacted by commits and signatures, in that while I can support reading invalid utf-8, once I try to modify one of these objects, I cannot recreate certain properties faithfully without using libgit2-sys directly, so I'm currently falling back to doing lossy utf-8 conversions.

I imagine the easiest/best(?) solution here might be to follow CString's API, and change these methods to take S: Into<Vec<u8>> rather than &str, but I think that would be a breaking change due to type inference(?)

alexcrichton commented 5 years ago

An excellent point here! I agree that generics are likely the best way to go here, and I think the next version of this crate will be breaking anyway so seems fine to me to switch to generics!

mystor commented 5 years ago

Ok - I started poking at doing this. There are some unfortunate properties of doing generics. For most functions, we can just use something like this:

impl<'a> Signature<'a> {
    fn now(name: impl IntoCString, email: impl IntoCString) -> Result<Signature<'a>, Error> { ... }
}

Which is still fairly clear. Unfortunately, this breaks down with many of the functions which take an optional string. For example:

impl<'repo> Commit<'repo> {
    pub fn amend(
        &self, 
        update_ref: Option<impl IntoCString>, 
        author: Option<&Signature>, 
        committer: Option<&Signature>, 
        message_encoding: Option<impl IntoCString>, 
        message: Option<impl IntoCString>, 
        tree: Option<&Tree<'repo>>
    ) -> Result<Oid, Error> { ... }
}

If we do the above, we break type inference for code passing None, which will now have to pass None as Option<String> or something like that. There are a few other options here:

We could use an IntoOptCString trait which is implemented like:

impl<T: IntoCString> IntoOptCString for Option<T> {
    fn into_opt_c_string(self) -> Result<Option<CString>, Error> { ... }
}

impl IntoOptCString for () {
    fn into_opt_c_string(self) -> Result<Option<CString>, Error> { Ok(None) }
}

Which would require callers to change None-passing callsites to pass () instead. We could also make a NoCString marker struct if that's cleaner.

Alternatively, we could directly accept &[u8] (or &CStr) as an argument type everywhere for consistency, requiring callsites with &str values to add a .as_bytes() (or CString::new(...)). Adding explicit _bytes versions of everything is also an option, albeit an unfortunate one.

I'm personally inclined to changing the default input & output types for these strings to &[u8]. it's definitely a more breaking change, but I think it's more accurate to the actual data.

alexcrichton commented 5 years ago

Ah right, that sounds like a familiar problem!

I'm a little worried about taking bytes everywhere though because it hurts the ergonomics of the library quite a bit (I feel at least) with only very niche benefit. I think I'd prefer to keep going the generic route but figure out a different strategy for optional arguments. Perhaps multiple methods with different argument lists or a builder-style pattern for these functions?

ijackson commented 3 years ago

Is there some reason not to do something like this:

impl<'repo> Commit<'repo> {
    pub fn amend(
        &self, 
        update_ref: Option<&dyn AsRef<[u8]>>, 

The only downside I can see is that this forces the library to always re-allocate and copy the string, even when it was an owned CString, but I doubt we care too much about that?

RalfJung commented 1 year ago

Looks like currently it is possible to create unsigned non-UTF8-objects via Odb::write, but creating a signed commit via commit_signed requires a str and therefore valid UTF-8.