rust-lang / libs-team

The home of the library team
Apache License 2.0
116 stars 18 forks source link

ACP: String-splitting iterators with indices #222

Closed clarfonthey closed 1 year ago

clarfonthey commented 1 year ago

Proposal

Problem statement

When using the split and similar iterators for strings, only the string slices are returned, without any indication of where they can be found in the original string. This is useful for processing the lines individually, but prevents later processing of the entire string based upon these lines.

Motivation, use-cases

Sentinel lines

Some file formats use sentinel lines to delimit the specific sections, like pacman's "desc" format:

%NAME%
rustup

%VERSION%
1.25.2-1

%BASE%
rustup

%DESC%
The Rust toolchain installer

%URL%
https://github.com/rust-lang/rustup.rs

%ARCH%
x86_64

%BUILDDATE%
1675514879

%INSTALLDATE%
1676666348

%PACKAGER%
Orhun Parmaksız <orhun@archlinux.org>

%SIZE%
8688993

%LICENSE%
MIT
Apache

%VALIDATION%
pgp

%REPLACES%
cargo-tree

%DEPENDS%
curl
xz
zstd

%OPTDEPENDS%
lldb: rust-lldb script
gdb: rust-gdb script

%CONFLICTS%
rust
cargo
rustfmt

%PROVIDES%
rust
cargo
rust-nightly
cargo-nightly
rustfmt
rust-src
lib32-rust-libs
rust-musl
rust-wasm

In these cases, it seems most useful to be able to iterate via lines, search for the sentinel sections, then later pass these individual sections to different parts of processing. While this could be accomplished by using a state machine inside the iterator, this would likely be more complicated to read and potentially less performant.

Isolating lines/spans

Imagine an iterator that parses the lines of a string into some other type, where the error returns the line that failed. If the type is parsed from a string slice, then this isn't a problem, since the error can have the same lifetime as the original slice. However, if the type being parsed is an owned string, then the specific line can't be returned directly, and instead has to be allocated into a separate string to be returned. If indices to the line were provided, then these could simply be saved and the original string buffer could be truncated to just the line, saving on allocations.

Additionally, beyond this (relatively contrived) example, indices are also useful for expanding failed lines into a multi-line span, like Rust does with its error messages. If the position of the line is obtained, then the string can be scanned backward and forward some number of lines back, and those indices can be used instead to slice a larger span. Without this, the caller would have to keep a buffer of some number of lines back, then later concatenate them into a larger string on error, which seems like way more work.

Note on workaround

It is possible to create this today by accumulating the total length of lines passed to the iterator to ultimately create an equivalent position per line, although this only works with split_inclusive since split delimiters can have variable length. While lines uses split_inclusive, this means that the lines case is possible to replicate, but it feels like the cases are common enough to allow a dedicated method.

There is also the alternative match_indices which will provide the exact indices necessary for not replicating too much work on the user's end, but again, it feels like the benefit outweighs the cost here.

Solution sketches

While the main desire here is for a line_indices iterator, many splitting iterators all use an internal SplitInternal iterator, and it feels reasonable to create a with-indices version of this, then augment all of the related methods to have a with-indices version. This would look something like:

impl str {
    pub fn line_indices(&self) -> LineIndices<'a>;
    pub fn split_indices<'a, P: Pattern<'a>>(&'a self, pat: P) -> SplitIndices<'a, P>;
    pub fn split_inclusive_indices<'a, P: Pattern<'a>>(&'a self, pat: P) -> SplitInclusiveIndices<'a, P>;
    pub fn rsplit_indices<'a, P: Pattern<'a, Searcher: ReverseSearcher<'a>>>(&'a self, pat: P) -> RSplitIndices<'a, P>;
    pub fn split_terminator_indices<'a, P: Pattern<'a>>(&'a self, pat: P) -> SplitTerminatorIndices<'a, P>;
    pub fn rsplit_terminator_indices<'a, P: Pattern<'a, Searcher: ReverseSearcher<'a>>>(&'a self, pat: P) -> RSplitTerminatorIndices<'a, P>;
}

And all of the iterators would return (usize, &str) items instead of just &str, to mirror char_indices which returns (usize, char).

This could potentially be extended to the splitn variants, which have a separate SplitNInternal iterator:

impl str {
    pub fn splitn_indices<'a, P: Pattern<'a>>(&'a self, n: usize, pat: P) -> SplitNIndices<'a, P>;
    pub fn rsplitn_indices<'a, P: Pattern<'a, Searcher: ReverseSearcher<'a>>>(&'a self, n: usize, pat: P) -> RSplitNIndices<'a, P>;
}

Links and related work

N/A for now

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

pitaj commented 1 year ago

This runs into the same issue as filling out the str API with things like rsplit_inclusive: it adds a lot of methods that are pretty niche.

This is useful however and could easily be built on top of my proposal for a string splitting builder API: https://github.com/rust-lang/libs-team/issues/168

m-ou-se commented 1 year ago

Is this the best/simplest API for this? What about e.g. a .index() on the existing Split types? Are there other (simple) alternatives?

joshtriplett commented 1 year ago

This seems like a large number of additional APIs for something that could have a few different alternatives.

This seems like it could be done by checking the offset of the &str from the original &str. We don't currently seem to have a clean API for that, but we could add one, and that seems like it'd solve various problems including this one.

BurntSushi commented 1 year ago

We don't currently seem to have a clean API for that, but we could add one, and that seems like it'd solve various problems including this one.

I like this idea.

clarfonthey commented 1 year ago

I personally like @pitaj's idea of using a builder API for this. Honestly, despite the builder pattern being incredibly widespread across the Rust ecosystem, we don't see it too much in the standard library. I'm going to close this in favour of that proposal.