oll3 / bita

Differential file synchronization over http
https://crates.io/crates/bita
MIT License
265 stars 10 forks source link

Add custom kvp metadata to chunk dictionary #51

Closed caesay closed 4 weeks ago

caesay commented 1 month ago

This PR contains the following changes:

Possible improvements?

oll3 commented 1 month ago

Great improvements, and good fix to the info command too! Will go through it in more detail later but here are some initial notes.

the metadata dictionary is not deterministic (ideally we should sort either by CLI order, or alphabetically. the latter would be easier).

Yes, I do think the map of kvp should have deterministic order, preferably ordered as when inserted. And since the protobuf map is really just a repeated field of kvp's the order in the dictionary should be stable so it's just a question of how to map it on the Rust side. Think you can configure prost to map map fields into a BTreeMap, or possibly just a plain Vec of pairs. Vec is probably sufficient until someone adds hundreds/thousands of pairs.

not sure how I feel about separate --metadata-file and --metadata-value arguments.

Personally I would find it to be enough with compress --metadata-value and one could provide files or values using the shell. But I'm ok with having a separate argument for reading directly from a file too.

I should probably put this in a contribution description somewhere... But if/when you do modifications to a PR please just modify each commit which is affected. Like switching to a BTreeMap would belong in the Add metadata commit etc.

caesay commented 1 month ago

Regarding --metadata-file <key> <path> and --metadata-value <key> <value>, out of the two, the more important one is --metadata-file, because that is the only way you can reliably provide arbitrary binary data. I doubt that you'll be able to pass a random binary file as an inline command line argument. (certainly not going to happen on Windows)

I tried switching the HashMap to a BTreeMap as suggested:

       prost_build::Config::new()
            .btree_map(&["."])
            .compile_protos(&["proto/chunk_dictionary.proto"], &["proto/"])
            .unwrap();

It kind of worked, in the sense that all of our code after clap did preserve the order in which we initially constructed the first BTreeMap. However, clap itself seems to have no consistency in the order in which it gives the command line arguments. Additionally, even if clap gave us the arguments in the order in which they were specified (it doesn't) I don't know how we'll preserve the order between the two different arguments, because clap doesn't give us the "position" of the argument. Hope that makes sense?

oll3 commented 1 month ago

Regarding --metadata-file <key> <path> and --metadata-value <key> <value>, out of the two, the more important one is --metadata-file, because that is the only way you can reliably provide arbitrary binary data. I doubt that you'll be able to pass a random binary file as an inline command line argument. (certainly not going to happen on Windows)

I can see that it may provide some value so I'm ok with providing both.

It kind of worked, in the sense that all of our code after clap did preserve the order in which we initially constructed the first BTreeMap. However, clap itself seems to have no consistency in the order in which it gives the command line arguments.

I see. I would've expected clap to give us the values in order of occurrence but maybe it doesn't. I didn't really find anything in the docs mentioning the order of multiple values and haven't tested for myself. About the order between -file and -value I can see that the order between them will be lost, as you did mention.

But then, for now at least, I think it's enough if the bitar api for building and reading/iterating the kvps keeps the build time order. E.g one should be able to iterate the kvps in the same order as they were provided when building the archive.

Cli argument order could potentially be fixed later. For now, maybe just order the pairs on key name on the cli side?

Other than that I think I would prefer a regular Vec (like Vec<(String, Vec<u8>)>) to hold the kvps after decoding the dictionary. Mainly since I reads well that there is a order, it's performant and simple compared to a BTreeMap/HashMap. If a user has lots of values to look up one can collect them into a HashMap outside of the library. So basically I think the metadata fn on the dictionary could just be something like:

  fn metadata(&self) -> &[(&str, &[u8])]

And the simplest way to achieve this is probably to make a repeated field in protobuf. Since a protobuf map is really stored the same way it's just a question of how prost will decode it.


message ChunkDictionary {
  ...
  repeated KeyValuePair metadata = 8;
}

message KeyValuePair {
  string key = 1;
  bytes value = 2;
}

``
Does this make any sense to you?
caesay commented 1 month ago

I think that sorting the keys alphabetically would be the best way to make sure they are deterministic. I can make that change.

Other than that I think I would prefer a regular Vec (like Vec<(String, Vec)>) to hold the kvps after decoding the dictionary.

I don't think performance is really a concern here (it's going to be fast enough either way) but switching to a Vec of key value pairs is going to be slower and will make consuming code more complex. We need at least one extra mapping in the library (to map to/from the protobuf types) and all the use cases require a hashmap anyways, so to support the info command we're going to need to collect the Vec into a hashmap anyway. I'll need to do the same in my own usecases: there's no value in having a Vec when you need to retrieve a value by key.

oll3 commented 1 month ago

I'll need to do the same in my own usecases: there's no value in having a Vec when you need to retrieve a value by key.

Fair enough. So would storing in a BTreeMap and have an api something like below be sufficient you think?

    fn metadata_iter(&self) -> impl Iterator<Item=(&str, &[u8])>
    fn metadata_value(&self, key: &str) -> Option<&[u8]>
caesay commented 4 weeks ago

I have made the discussed changes, please let me know if there's anything further!

oll3 commented 4 weeks ago

Think it looks good!

Was thinking that the metadata length value in the info command should use a unit (eg use human_size!) but realized it looked a bit busy when unit it's repeated for every kvp when we have multiple kvps, so I think this is fine. Also a metadata_len might have been good to check number of kvps easily. But it's nothing that can't be added when needed.

Thank you, will merge!