luizirber / niffler

Simple and transparent support for compressed files.
Apache License 2.0
75 stars 7 forks source link

API improvements #1

Closed luizirber closed 4 years ago

luizirber commented 5 years ago

@natir mentioned API improvements in a twitter discussion, moving it here =]

natir commented 4 years ago

Ok I think we need to have a very simple API, with only thrid public function and one structure.

Something like this:

pub struct CompressionFormat;

pub fn read_compression(mut in_stream: Box<dyn io::Read + io::Seek>) -> CompressionFormat;

pub fn read(input_name: &str) -> Result<(Box<dyn io::Read + io::Seek>, CompressionFormat), Error>;
pub fn write(output_name: &str, format: CompressionFormat) -> Result<Box<dyn io::Write>, Error>;

With this API we work only on std::io::Read + std::io::Seek. Work with no Seek readable stuff it's harder and I think we can move this functionality for a later version.

We can maybe build a system with configuration to allow user to choose which library they want use for each type of file?

luizirber commented 4 years ago

What do you think about releasing 1.0.0 with the current content, and then iterate on improved API and then release 2.0.0? Rust crates have this tendency of staying in 0.x waiting for the perfect API, but just bumping the major version is going to work well with cargo (and won't break usage of 1.x).

natir commented 4 years ago

I think we can release version 1.0.0 by just clean the actual source code.

bovee commented 4 years ago

I did a little perf checking of the chaining back when I implemented it in Needletail, but I just got a little paranoid about it again so I threw together a small benchmark comparing it (and another implementation where you wrap with BufReader) to the base cae:

https://github.com/bovee/inline_gzip

It still looks like any perf hit is basically negligible compared to buffer-sizing effects? Even if there were a small hit, I think the convenience of being able to read out of stdin is a huge win (and since flate2 isn't as performant as pigz, you can swap upwards as you need more speed in your pipeline).

natir commented 4 years ago

Many thanks @bovee !!

Indeed, the differences in performance are negligible.

I am now convinced that we can use your iterator chain method.

@luizirber I create a pull request tomorrow and we can implementation details in this pull request ?

bovee commented 4 years ago

Any time! It'd be great to tear these functions out of needletail so they could be used more widely so looking forward to seeing where this repo goes. :)

And a couple ideas about the API above (ordered from useful to bikeshed-y):

luizirber commented 4 years ago

@luizirber I create a pull request tomorrow and we can implementation details in this pull request ?

Sounds great! I set up a bunch of CI during the weekend (and I just noticed you created #3 \o/)

Any time! It'd be great to tear these functions out of needletail so they could be used more widely so looking forward to seeing where this repo goes. :)

I don't think this crate needs to cover every single use case out there, but there seems to be many reimplementations of similar ideas in many places. It would be great to be useful to needletail too!

Exposing read_stream/write_stream methods would be really handy if you already have the Read/Write object and just want transparent (de)compression.

yesssss... Especially for places that don't really have files/paths (like wasm32 =])

Rather than using a sentinel value like "-" in read(...), maybe input_name should take an Option<&str>? This requires the caller to have boilerplate though so maybe just implementing the above *_stream methods is enough.

I'm OK with - because many CLI programs take - as meaning stdin or stdout

It might not be the best idea, but write could guess the compression type from the extension on output_name? This wouldn't work for writing to stdout though so maybe not the best idea.

Yeah, I think this one can lead to many surprising corner cases (like the stdout one).

For naming, including compress or decompress in the names might improve clarity? Like compress_to_path or decompress_from_path?

I like write/read better here because it doesn't necessarily compress/decompress the content (it works with uncompressed files too).

natir commented 4 years ago

Finally I made the pull request this evening.

About your suggestions, which are obviously welcome:

  1. For me API just need two public functions get_reader/get_writer they take a stream and provide a stream, no filename. API didn't need more thing, I want to follow the KISS philosophies, but if you want more thing in API no problem
  2. This behavior comes from natir/yacrd but I think it has no place in a lib
  3. I didn't like this idea and if we keep just two public stream in/stream out function, we haven't access to input_name
  4. I don't want to add to_path function in API, users can open/create file alone and they can wrapper stream in Buffer or other stuff like that
natir commented 4 years ago

@luizirber I am happy to see that we agree on how the API should be.

But for the moment (in branch api_1.0) we do not expose any function that allows the user to test himself the compression of his file. Maybe it's a good idea to have:

read_compression(in_stream: Box<dyn io::Read>) -> Result<(CompressionFormat, Box<dyn io::Read>), NifflerError> detect_compression(in_stream: Box<dyn io::Read + io::Seek>) -> Result<CompressionFormat, NifflerError>

If you have better ideas for function names I'll take them.

luizirber commented 4 years ago

From conversation on twitter: I think a variation of the original get_input/get_output (taking filenames) is still quite useful, what do you think about adding from_path/to_path functions:

pub fn from_path<P: AsRef<Path>>(path: P)
 -> Result<(Box<dyn io::Read>, compression::Format), Error>;

pub fn to_path<P: AsRef<Path>>(
    path: P,
    format: compression::Format,
    level: compression::Level
) -> Result<Box<dyn io::Write>, Error>;
enormandeau commented 4 years ago

If the aim is to be high level, then using the file name with a simple to use reader function to return an iterable object that can read line by line is very useful. Same with a writer function that has the standard .write, .write_all... functionalities. The whole thing should be transparent to the user as if they used File::open or File::create.

Here is an example in Python to show what I mean by high level and being transparent to the user:

def myopen(_file, mode="rt"):
    if _file.endswith(".gz"):
        return gzip.open(_file, mode=mode)

    else:
        return open(_file, mode=mode)

After using myopen, the file handle object behaves as expected for a normal file.

Thanks for the conversation and taking my comments into account!

natir commented 4 years ago

Hi @enormandeau,

In fact your point it's actualy down, get_reader, get_writer return Box<dyn io::Read> and Box<dyn io::Write>, and this stuff implement io::Read and io::Write like File::open() and File::create().

When from_path to_path is merged your point it's totally solve and even better resolved because we don't trust the name of the file. We should never trust a file name.

Thanks for your feedback

enormandeau commented 4 years ago

Thanks for taking the time to reply @natir. I understand your position. I never developed a library for others to use but your vision (about the best bugs being other people's bugs and problems) sounds right :) I like the idea of your library niffler and will watch it closely. I think something like this is painfully missing from Rust.

I'll try to dig in the documentation more and use get_reader and get_writer.

How stable would you say niffler is at this point? How likely is the API to change significantly?

natir commented 4 years ago

How stable would you say niffler is at this point? How likely is the API to change significantly?

For me, @luizirber have maybe a different opinion, niffler is quiet stable and after integration of from_path/to_path the public API will not change for a long time.