Closed luizirber closed 4 years ago
Ok I think we need to have a very simple API, with only thrid public function and one structure.
Something like this:
pub struct CompressionFormat;
pub fn read_compression(mut in_stream: Box<dyn io::Read + io::Seek>) -> CompressionFormat;
pub fn read(input_name: &str) -> Result<(Box<dyn io::Read + io::Seek>, CompressionFormat), Error>;
pub fn write(output_name: &str, format: CompressionFormat) -> Result<Box<dyn io::Write>, Error>;
With this API we work only on std::io::Read + std::io::Seek. Work with no Seek readable stuff it's harder and I think we can move this functionality for a later version.
We can maybe build a system with configuration to allow user to choose which library they want use for each type of file?
On std::io::Seek
: I remember seeing this comment by @bovee in https://github.com/onecodex/needletail/pull/26:
automatic decompression now takes Read instead of Read + Seek so we can handle e.g. gzip files piped in through stdin
Is this relevant to your proposal?
read
/write
sound a bit like actions. How about reader
/writer
, or get_reader
/get_writer
? Although it is a bit more verbose...
I used failure
for errors, but thiserror
seems less intrusive (for a library).
What do you think about releasing 1.0.0
with the current content, and then iterate on improved API and then release 2.0.0
? Rust crates have this tendency of staying in 0.x
waiting for the perfect API, but just bumping the major version is going to work well with cargo (and won't break usage of 1.x
).
Thank for the discover of needletail it's look interesting and it probably useful for another project :) I explore solution they used and it's cool (chainning of stream), but I'm little bit concerne about effect of this method on performance (probably for non reason). We can maybe have _seekable
version of read function.
get_reader
/get_writer
seems perfect for me but if you realy preferd reader
writer
it's okay for me
I have no opinion on this point.
I think we can release version 1.0.0
by just clean the actual source code.
I did a little perf checking of the chaining back when I implemented it in Needletail, but I just got a little paranoid about it again so I threw together a small benchmark comparing it (and another implementation where you wrap with BufReader
) to the base cae:
https://github.com/bovee/inline_gzip
It still looks like any perf hit is basically negligible compared to buffer-sizing effects? Even if there were a small hit, I think the convenience of being able to read out of stdin is a huge win (and since flate2 isn't as performant as pigz, you can swap upwards as you need more speed in your pipeline).
Many thanks @bovee !!
Indeed, the differences in performance are negligible.
I am now convinced that we can use your iterator chain method.
@luizirber I create a pull request tomorrow and we can implementation details in this pull request ?
Any time! It'd be great to tear these functions out of needletail so they could be used more widely so looking forward to seeing where this repo goes. :)
And a couple ideas about the API above (ordered from useful to bikeshed-y):
read_stream
/write_stream
methods would be really handy if you already have the Read
/Write
object and just want transparent (de)compression.
"-"
in read(...)
, maybe input_name
should take an Option<&str>
? This requires the caller to have boilerplate though so maybe just implementing the above *_stream
methods is enough.write
could guess the compression type from the extension on output_name
? This wouldn't work for writing to stdout
though so maybe not the best idea.compress_to_path
or decompress_from_path
?@luizirber I create a pull request tomorrow and we can implementation details in this pull request ?
Sounds great! I set up a bunch of CI during the weekend (and I just noticed you created #3 \o/)
Any time! It'd be great to tear these functions out of needletail so they could be used more widely so looking forward to seeing where this repo goes. :)
I don't think this crate needs to cover every single use case out there, but there seems to be many reimplementations of similar ideas in many places. It would be great to be useful to needletail too!
Exposing
read_stream
/write_stream
methods would be really handy if you already have theRead
/Write
object and just want transparent (de)compression.
yesssss... Especially for places that don't really have files/paths (like wasm32 =])
Rather than using a sentinel value like
"-"
inread(...)
, maybeinput_name
should take anOption<&str>
? This requires the caller to have boilerplate though so maybe just implementing the above*_stream
methods is enough.
I'm OK with -
because many CLI programs take -
as meaning stdin
or stdout
It might not be the best idea, but
write
could guess the compression type from the extension onoutput_name
? This wouldn't work for writing tostdout
though so maybe not the best idea.
Yeah, I think this one can lead to many surprising corner cases (like the stdout
one).
For naming, including compress or decompress in the names might improve clarity? Like
compress_to_path
ordecompress_from_path
?
I like write
/read
better here because it doesn't necessarily compress
/decompress
the content (it works with uncompressed files too).
Finally I made the pull request this evening.
About your suggestions, which are obviously welcome:
get_reader
/get_writer
they take a stream and provide a stream, no filename. API didn't need more thing, I want to follow the KISS philosophies, but if you want more thing in API no problemstream in
/stream out
function, we haven't access to input_name
to_path
function in API, users can open/create file alone and they can wrapper stream in Buffer or other stuff like that@luizirber I am happy to see that we agree on how the API should be.
But for the moment (in branch api_1.0) we do not expose any function that allows the user to test himself the compression of his file. Maybe it's a good idea to have:
read_compression(in_stream: Box<dyn io::Read>) -> Result<(CompressionFormat, Box<dyn io::Read>), NifflerError>
detect_compression(in_stream: Box<dyn io::Read + io::Seek>) -> Result<CompressionFormat, NifflerError>
If you have better ideas for function names I'll take them.
From conversation on twitter: I think a variation of the original get_input
/get_output
(taking filenames) is still quite useful, what do you think about adding from_path
/to_path
functions:
pub fn from_path<P: AsRef<Path>>(path: P)
-> Result<(Box<dyn io::Read>, compression::Format), Error>;
pub fn to_path<P: AsRef<Path>>(
path: P,
format: compression::Format,
level: compression::Level
) -> Result<Box<dyn io::Write>, Error>;
If the aim is to be high level, then using the file name with a simple to use reader function to return an iterable object that can read line by line is very useful. Same with a writer function that has the standard .write
, .write_all
... functionalities. The whole thing should be transparent to the user as if they used File::open
or File::create
.
Here is an example in Python to show what I mean by high level and being transparent to the user:
def myopen(_file, mode="rt"):
if _file.endswith(".gz"):
return gzip.open(_file, mode=mode)
else:
return open(_file, mode=mode)
After using myopen
, the file handle object behaves as expected for a normal file.
Thanks for the conversation and taking my comments into account!
Hi @enormandeau,
In fact your point it's actualy down, get_reader
, get_writer
return Box<dyn io::Read>
and Box<dyn io::Write>
, and this stuff implement io::Read
and io::Write
like File::open()
and File::create()
.
When from_path
to_path
is merged your point it's totally solve and even better resolved because we don't trust the name of the file. We should never trust a file name.
Thanks for your feedback
Thanks for taking the time to reply @natir. I understand your position. I never developed a library for others to use but your vision (about the best bugs being other people's bugs and problems) sounds right :) I like the idea of your library niffler and will watch it closely. I think something like this is painfully missing from Rust.
I'll try to dig in the documentation more and use get_reader
and get_writer
.
How stable would you say niffler is at this point? How likely is the API to change significantly?
How stable would you say niffler is at this point? How likely is the API to change significantly?
For me, @luizirber have maybe a different opinion, niffler is quiet stable and after integration of from_path
/to_path
the public API will not change for a long time.
@natir mentioned API improvements in a twitter discussion, moving it here =]