Open kbknapp opened 6 years ago
- Using String as an input type for arbitrary data that doesn't necessarily need to be UTF-8
I'm not sure in the year 2018 anyone should not be expecting utf-8 input :thinking:
@spacekookie If your tool is designed to process arbitrary files, and that processing can be done in a somewhat encoding agnostic way (for example, one might only need the assumption that the text is ASCII compatible), then you really can't use String
. Many of the tools I write fall into this category, so I am personally biased in witnessing the representation of these cases, but I wouldn't be surprised if this were a common thing when writing tools intended for use by others on Unix-like systems.
On Windows this gets a bit more complicated. If you write a CLI tool for Windows that reads files, then at some point, you're likely to come across UTF-16. This can be handled automatically via BOM-sniffed and transcoding, but there's no out-of-the-box convenient abstraction to handle this. (There are nominal String::from_utf16
methods, which work well if you're OK reading an entire file into memory. For some tools, that's OK.)
This can be handled automatically via BOM-sniffed and transcoding, but there's no out-of-the-box convenient abstraction to handle this.
Sounds like a good candidate for any interested party in the WG to take on.
Sounds like a good candidate for any interested party in the WG to take on.
For streaming transcoding, see:
The String issue is, in my view, huge. If your CLI app takes a filename but can't handle my non-UTF8-but-perfectly-valid argument, for example, that's a bug. And in general, non-String-but-text-data ([u8], OsString) feels very second-class-citizen in Rust. Most libraries/APIs/macros require String... such as std::fmt (including Debug and Display). And putting unchecked data into a String is considered UB.
Ideally, an overarching trait or similar could be created to expand existing systems to the non-UTF8 story. I'm sure that's not a 2018 possibility though, and would be interested to see what the WG suggestions in the meantime.
(Side note, since the OP mentioned the "ideal guide" issue: the quicli docs are mentioned as a "concrete example" there. Their front page example has file: String.)
@jeekobu An RFC was recently merged that adds Pattern
support to OsStr
, which should make it a bit nicer to use. IMO, it should also be done for &[u8]
, but the RFC only addresses OsStr
.
I think fixing the "programs demand UTF-8, even for file paths" bug is the realm of argv parsers. They should make it very easy to handle file paths correctly.
The Pattern extension is why I have some hope that quality &[u8] and/or OsStr support is a future possibility ☺️. But it's unstable (and thus currently easy to extend or modify), unlike most of std::fmt (which is where I think most of the pain comes from).
Anyway, after I made that comment, I found #26 which more directly addresses this hurdle.
Unfortunately managed to shoot myself in the foot regarding reading lines from stdin, see here
https://www.reddit.com/r/rust/comments/aaood3/go_version_of_program_is_40_quicker_than_rust/
and follow up gist here https://gist.github.com/djhworld/ced7eabca8e0c06d5b65917e87fb8745
I think it would be great if the CLI book could at least document a recommended way of reading newline deilimited input from stdin to avoid falling into these traps, or at least have an example application that does something trivial like converting input to uppercase or something.
@djhworld very good point! I have a TODO open for stdin in #95.
I'm not entirely sure what the book should recommend precisely; I'd say using a buffered reader is uncontroversial. Reusing the line buffer however will require you to be very careful in that you always clear it and don't expect to be able to store it permanently without cloning.
Do you think that reading from stdin, using a buffered reader, will often be a performance issue? Or would other parts of the program be usually vastly slower?
I'm not entirely sure what the book should recommend precisely; I'd say using a buffered reader is uncontroversial. Reusing the line buffer however will require you to be very careful in that you always clear it and don't expect to be able to store it permanently without cloning.
I think putting this in the book, in almost these exact words would be super helpful. Coupled with an example it would probably help people a lot.
Do you think that reading from stdin, using a buffered reader, will often be a performance issue?
I've seen the issue of folks unknowingly acquiring a lock on stdin over and over come up several times. It seems that at least for trivial benchmarks / exploratory work this is an actual issue. I suspect this might also be an actual bottleneck for people building filtering shell tools, similar to grep
or jq
.
and would be interested to see what the WG suggestions in the meantime.
WG is the acronym for ... ??
Working Group
Is there any recommended pattern or best-practice for writing a piping CLI that efficiently reads from and writes to stdin/stdout?
Even having a reference implementation to look at would be useful.
@yuvadm I haven't seen a write-up about it, but I'd love to see one! :)
Maybe have a look at what some projects use? Reading and writing to std{in,out} should be easy enough, but it gets tricky if you want to handle "broken pipe" errors gracefully. This was also discussed in #21, btw.
It seems like we're coming across a few small "footguns", some common some less so, so it may be nice to collect them in one place so when writing something like #6 we don't forget anything.
This could happen in #6, but I'd prefer if it were separate so this can dive a little deeper and actually make a list.
Off the top of my head from stuff that's come up today in the gitter and elsewhere:
stdin
line by line is slow, because it does a heap alloc at each line. Easiest solution isBufReader::read_lines
String
as an input type for arbitrary data that doesn't necessarily need to be UTF-8println!
when writing lots of small output, which locksstdout
at each call. Prefer lockingstdout
manually and using something likewrite!
libc::close
manually, and checking for errors)println!
can panic on a broken pipestd::env::args
panics at invalid UTF-8 (the fix is to usestd::env::args_os
but this is not very intuitive, and with issues like #26 it's far more friction than it should be)If you've got more I'll add them. Of course I think we should also be open to discussion about the ones I listed as to why they should, or should not be listed and/or other fixes for them. :smile:
Gitter log of reading `stdin` line by line
@rcoh 18:37: Hello! I'm working on https://github.com/rcoh/angle-grinder in the process, I noticed that reading from stdin via .lines() is pretty slow there are ways around it, of course. ripgrep uses a custom written stdin buffer @BurntSushi 18:38: yeah it creates an alloc for each new line. ripgrep's buffer is quite a bit more involved. :grinning: easiest path to something fast is BufReader::read_line @rcoh 18:38: In any case, I was wondering if there were any plans to make reading lots of data from stdin a more generally supported pattern or if ripgrep's pattern could be abstracted into a library in some way @BurntSushi 18:39: ripgrep doesn't really read line-by-line, so it's a bit different @rcoh 18:40: but I assume at some point in the internals, it's splitting things into lines? @BurntSushi 18:40: nope @rcoh 18:40: oh interesting! @BurntSushi 18:40: only for output @rcoh 18:40: doesn't grep have line-by-line semantics? Does ripgrep as well? @BurntSushi 18:41: in the (very common) case that searching a file yields no matches, it is possible that the concept of a "line" never actually materializes at all @rcoh 18:41: interesting. @BurntSushi 18:41: see: https://blog.burntsushi.net/ripgrep/#mechanics (GNU grep does the same thing) @rcoh 18:42: My current approach is: let mut line = String::with_capacity(1024); while buf.read_line(&mut line).unwrap() > 0 { self.proc_str(&(line)); line.clear(); } Is that fastest I can do without resorting to a totally different strategy? @BurntSushi 18:42: pretty much, yes. if you could use a Vec instead of a String, then you can use read_until and avoid the UTF-8 validation check. depends on what you're doing though!
if you're processing data from arbitrary files, then String is very likely the wrong thing to use
@rcoh 18:44: it's intended to process log files so it's almost always printable data except in some weird cases
@BurntSushi 18:44: it all depends on whether you're OK with it choking on non-UTF-8 encoded text
@rcoh 18:53: Got it. Thanks for your help! I literally couldn't think of a more qualified person to have answer my question :grinning:
@BurntSushi 18:53: :grinning: FWIW, I do intend on splitting the core ripgrep search code out into the grep crate, which will include the buffer handling. it's likely possible that the API will be flexible enough to do line-by-line parsing as fast as possible (although, avoiding line-by-line will still of course be faster for many workloads)
@rcoh 19:02: Cool. I don't actually need to be line-by-line until after running some filtering so that would be useful for my use case.