Closed rsc closed 1 month ago
Iterating over runes is served by a regular range loop
Will Bytes
and Runes
be added to the bytes
package as well? Runes
in particular is not entirely trivial today for byte-slices.
Should Lines
also yield the 1-based line number? It's not always needed but it'd be easy enough to ignore when you don't.
Iterating over bytes in a string is common and too difficult, since range ranges over runes.
Is i, c := range []byte(s)
too difficult? I would say the advantage is that strings.Bytes is more discoverable, but the []byte() version is preferred when you don't need an iterator for some reason.
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
Iterating over bytes in a string is common and too difficult, since range ranges over runes.
Is
i, c := range []byte(s)
too difficult? I would say the advantage is that strings.Bytes is more discoverable, but the []byte() version is preferred when you don't need an iterator for some reason.
That relies on specialized optimizations to avoid an allocation and a copy. I'm not sure if those optimizations exist right now. The current best way to do it, as far as I know, is
for i := 0; i < len(s); i++ {
c := s[i]
// ...
}
I don't think that's particularly difficult, personally, but I think that strings.Bytes()
could still be useful as a feed in to a transformative iterator.
Similar to Lines, there should be iterator forms of Split, Fields, and Runes
Unless I misunderstand I think you just mean Split and Fields. And there is another misplaced reference to Runes a few lines down.
I like this overall. Two things I wonder about:
1) It seems to me that Lines
should trim the line ending from the string. That's how the existing similar tool bufio.ScanLines
works, and it's almost always what you want, otherwise you have to call strings.TrimSpace
or strings.TrimRight
before you use the string. It seems to me for the few cases you really need the line ending (or the ability to rebuild the exact input text) you can use strings.Index
manually.
2) I wonder if SplitSeq
and FieldsSeq
will pay their way. I often need the length of the split result, for example with ad-hoc parsing of separator-separated values, you often want to know the number of columns/fields. (Though I admit that since strings.Cut
, that takes care of quite a few previous use cases.) Should we do some analysis of this in real-world code to determine whether it's worth adding these?
Is i, c := range []byte(s) too difficult?
Don‘t think so. And shouldn‘t there be only one way of doing things in go?
You can pass iterators to other functions instead of just immediately using it in a range so being able to make iterators for built in things is still useful.
ISTM that strings.Lines encourages people to write:
buf, _ := os.ReadFile(
for _, line := range strings.Lines(string(buf)) { ... }
I'd rather see an iterator version of bufio.Scanner.
@aarzilli I think it's pretty trivial to add an All()
method to bufio.Scanner
. strings.Lines
would still be useful. Note that s := bufio.NewScanner(strings.NewReader(str)); for _, line := s.All() { … }
requires to copy data around a bunch (because of how io.Reader
works), while for _, line := strings.Lines(str) { … }
doesn't copy anything.
Bytes(s string) iter.Seq2[int, byte]
I wonder if it's necessary to return the index for bytes, since it's trivial to count them yourself. I guess it's nice to mirror the for/range statement.
func SplitSeq(s, sep string) iter.Seq[string]
Should this return the start index of the string too?
Similar to Lines, there should be iterator forms of Split, Fields, and Runes
Should we have SplitAfterNSeq and SplitNSeq too? If not, why? These are the only ones left out.
I wonder if SplitSeq and FieldsSeq will pay their way. I often need the length of the split result, for example with ad-hoc parsing of separator-separated values, you often want to know the number of columns/fields. (Though I admit that since strings.Cut, that takes care of quite a few previous use cases.) Should we do some analysis of this in real-world code to determine whether it's worth adding these?
If you need to know the length, the existing functions can handle that case.
Join would be useful to have. For example, just now I wanted to make a String() method for a named slice. The easy way to do it would be strings.JoinAll(iterutil.Map(slices.Values(s), Element.String), ", ")
.
Finishing this proposal discussion is blocked on #61405.
Change https://go.dev/cl/558735 mentions this issue: bytes, strings: add SplitSeq, SplitAfterSeq, FieldsSeq, FieldsFuncSeq
Have all remaining concerns about this proposal been addressed?
The proposal is to add these to both bytes and strings:
// Lines returns an iterator over the newline-terminated lines in the string s.
// The lines yielded by the iterator include their terminating newlines.
// If s is empty, the iterator yields no lines at all.
// If s does not end in a newline, the final yielded line will not end in a newline.
func Lines(s string) iter.Seq[string]
// SplitSeq returns an iterator over all substrings of s separated by sep.
// The iterator yields the same strings that would be returned by Split(s, sep),
// but without constructing the slice.
func SplitSeq(s, sep string) iter.Seq[string]
// SplitAfterSeq returns an iterator over substrings of s split after each instance of sep.
// The iterator yields the same strings that would be returned by SplitAfter(s, sep),
// but without constructing the slice.
func SplitAfterSeq(s, sep string) iter.Seq[string]
// FieldsSeq returns an iterator over substrings of s split around runs of
// whitespace characters, as defined by unicode.IsSpace.
// The iterator yields the same strings that would be returned by Fields(s),
// but without constructing the slice.
func FieldsSeq(s string) iter.Seq[string]
// FieldsFuncSeq returns an iterator over substrings of s split around runs of
// Unicode code points satisfying f(c).
// The iterator yields the same strings that would be returned by FieldsFunc(s),
// but without constructing the slice.
func FieldsFuncSeq(s string, f func(rune) bool) iter.Seq[string]
// Lines returns an iterator over the newline-terminated lines in the string s. // The lines yielded by the iterator include their terminating newlines.
I would not expect them to include the terminating newlines if I didn't read the documentation.
I would not expect them to include the terminating newlines if I didn't read the documentation.
Good reason to read the documentation! 😄
If the newlines are not included, then the user of the iterator cannot distinguish Lines("abc") from Lines("abc\n"). It is often important to know whether a file has a non-terminated final line.
If the 'last' line does end in a newline, does Lines()
yield an empty line without a newline after it? In other words, would Lines("example\n")
yield just "example\n"
or would it yield "example\n"
and then ""
?
Edit: And if it does yield that last line, how is Lines(str)
any different from SplitAfterSeq(str, "\n")
besides having a cleaner, shorter name?
I know there are cases where it matters if the file ends with a newline or not but I don't think I have ever had to deal with that in any code I have ever written.
I think the reason to keep the \n is then you don't have to document what happens to \r\n.
If the 'last' line does end in a newline, does Lines() yield an empty line without a newline after it? In other words, would Lines("example\n") yield just "example\n" or would it yield "example\n" and then ""?
It does not yield a final "".
I would not expect them to include the terminating newlines if I didn't read the documentation.
Good reason to read the documentation! 😄
If the newlines are not included, then the user of the iterator cannot distinguish Lines("abc") from Lines("abc\n"). It is often important to know whether a file has a non-terminated final line.
@rsc In my experience, I almost never care if a file ends in a newline or not and I'd want all newlines to be stripped off, so I can easily imagine having to use FieldsFuncSeq("abc\n", func(c rune) bool { return c == '\n' || c == '\r' })
a lot instead, and I wonder if it should be documented in Lines
's docs. What do you think?
Other than this concern of mine, this proposal LGTM. :+1:
Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group
The proposal is to add these to both bytes and strings:
// Lines returns an iterator over the newline-terminated lines in the string s.
// The lines yielded by the iterator include their terminating newlines.
// If s is empty, the iterator yields no lines at all.
// If s does not end in a newline, the final yielded line will not end in a newline.
func Lines(s string) iter.Seq[string]
// SplitSeq returns an iterator over all substrings of s separated by sep.
// The iterator yields the same strings that would be returned by Split(s, sep),
// but without constructing the slice.
func SplitSeq(s, sep string) iter.Seq[string]
// SplitAfterSeq returns an iterator over substrings of s split after each instance of sep.
// The iterator yields the same strings that would be returned by SplitAfter(s, sep),
// but without constructing the slice.
func SplitAfterSeq(s, sep string) iter.Seq[string]
// FieldsSeq returns an iterator over substrings of s split around runs of
// whitespace characters, as defined by unicode.IsSpace.
// The iterator yields the same strings that would be returned by Fields(s),
// but without constructing the slice.
func FieldsSeq(s string) iter.Seq[string]
// FieldsFuncSeq returns an iterator over substrings of s split around runs of
// Unicode code points satisfying f(c).
// The iterator yields the same strings that would be returned by FieldsFunc(s),
// but without constructing the slice.
func FieldsFuncSeq(s string, f func(rune) bool) iter.Seq[string]
@rsc I'm not seeing any mention of FieldsFuncSeq
in Lines
' proposed doc as per my previous comment, so I was wondering if you had any thoughts? :)
How about func Lines(string) iter.Seq2[string, bool]
where the string never has the newline and the bool states whether it had one? If you do not care, you can ignore the bool; and, if you do care, you have a nice handy bool to check instead of having to inspect the string each time.
@jbduncan, many people do care if a file ends in a newline. People who don't can strip the \n easily enough; if Lines doesn't include the \n, people who care cannot get it back.
@rsc There is some percentage of usages of such a mechanism that require the newline. If it were majority newline or even 50/50 this would be a fine API. I don't know what the actual percentages are for needing \n but I would guess it's much less than 50%. I wouldn't mind if there were a pair of iterators for newline and sans-newline or if there's just the Seq2[string, bool]
version I proposed above. The rest of the proposal is solid.
As existing precedence, the bufio.Reader.ReadLine
method excludes the newline.
Personally, I've wanted newlines excluded a majority of the time, but have wanted it occasionally.
People who don't can strip the \n easily enough; if Lines doesn't include the \n, people who care cannot get it back.
It seems to me that if I wanted to include newlines, I should just use strings.SplitAfterSeq
. It does mean that I'll need to deal with the final ""
, but that seems pretty minor. It seems that there's ergonomic utility in just excluding the newline for Lines
.
I think the choice matters mostly because we don't have #61898 (or anything like it). It's easy to do a s = strings.TrimSpace(s)
in a loop, but if what you want is something to feed to another function, you'd have to manually implement something like xiter.Map
, which is less easy.
Which, to me, also seems to imply that @rsc is correct at least in the long-run. At some point, you can write xiter.Map(strings.TrimSpace, strings.Lines(s))
, which seems like an okay price to pay for the more general API.
TrimSpace
is not the correct thing to do here in the majority of uses. In general you'd want to apply a function that either trims all right whitespace or just trims line endings.
I said before that returning the \n means you don't have to document the \r\n case, but now I lean to the side that Lines should be like bufio.Scanner and strip \r\n and not return a final empty "". People who want the endings and the final blank can use strings.SplitAfterSeq
as @dsnet suggests.
if Lines doesn't include the \n, people who care cannot get it back.
Lines
is basically just a more convenient Split[After]Seq anyway, so I think it should aim for maximum convenience, not maximum flexibility.
If @earthboundkid's suggestion isn't palatable, then I still believe that a newline-stripping alternative should be documented in Lines
' docs, like my previous suggestion FieldsFuncSeq("abc\n", func(c rune) bool { return c == '\n' || c == '\r' })
.
No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group
The proposal is to add these to both bytes and strings:
// Lines returns an iterator over the newline-terminated lines in the string s.
// The lines yielded by the iterator include their terminating newlines.
// If s is empty, the iterator yields no lines at all.
// If s does not end in a newline, the final yielded line will not end in a newline.
func Lines(s string) iter.Seq[string]
// SplitSeq returns an iterator over all substrings of s separated by sep.
// The iterator yields the same strings that would be returned by Split(s, sep),
// but without constructing the slice.
func SplitSeq(s, sep string) iter.Seq[string]
// SplitAfterSeq returns an iterator over substrings of s split after each instance of sep.
// The iterator yields the same strings that would be returned by SplitAfter(s, sep),
// but without constructing the slice.
func SplitAfterSeq(s, sep string) iter.Seq[string]
// FieldsSeq returns an iterator over substrings of s split around runs of
// whitespace characters, as defined by unicode.IsSpace.
// The iterator yields the same strings that would be returned by Fields(s),
// but without constructing the slice.
func FieldsSeq(s string) iter.Seq[string]
// FieldsFuncSeq returns an iterator over substrings of s split around runs of
// Unicode code points satisfying f(c).
// The iterator yields the same strings that would be returned by FieldsFunc(s),
// but without constructing the slice.
func FieldsFuncSeq(s string, f func(rune) bool) iter.Seq[string]
What's the accepted version of Lines?
What's the accepted version of Lines?
And at the risk of sounding like a broken record, has my suggestion to document a newline-stripping variant of Lines
been considered? Even if the answer is "yes, we've considered it, but we've decided against it", I'd be happy to accept that. :)
What's the accepted version of Lines?
Whats accepted is explicitly documented in the acceptance comment: https://github.com/golang/go/issues/61901#issuecomment-1973619597
// The lines yielded by the iterator include their terminating newlines.
There was a lot of discussion in between the final comment period starting and the actual final decision, so I think it's good to be explicit about the ruling on that.
Just a minor thought on the docs for Lines
:
// Lines returns an iterator over the newline-terminated lines in the string s.
// The lines yielded by the iterator include their terminating newlines.
// If s is empty, the iterator yields no lines at all.
// If s does not end in a newline, the final yielded line will not end in a newline.
If s
does not contain any newline characters, a strict reading of the first sentence seems to imply I will not iterate over anything:
// Lines returns an iterator over the newline-terminated lines in the string s.
Because in that case there are no newline-terminated lines in the string s
.
Similarly the last sentence does not quite sit right for me:
// If s does not end in a newline, the final yielded line will not end in a newline.
Because if it doesn't end in a newline, it surely falls outside the definition of "an iterator over the newline-terminated lines".
All that said, I haven't managed to conjure up anything better! Because capturing the edge cases in pithy documentation here seems particularly hard.
The accepted version of Lines is what is in the accept message: the newlines are included in the lines, for the reasons given above. In essentially all cases I can think of, the newline chopping is easily done when parsing the line inside the loop over the lines. As others have noted, it is easy to do s = strings.TrimSuffix(s, "\n").
@dsnet mentioned bufio.Reader.ReadLine, but that doc comment says "ReadLine is a low-level line-reading primitive. Most callers should use Reader.ReadBytes('\n') or Reader.ReadString('\n') instead or use a Scanner." ReadBytes and ReadString do include the delimiter, while Scanner by default does not. The standard library is as split as this discussion, but the basic primitives ReadBytes and ReadString preserve the \n, while the higher-level, more configurable API defaults to removing them but is easily reconfigured (with a different split function).
My general rule for parsing functionality is to avoid throwing away information, which is the justification for keeping the \n.
Change https://go.dev/cl/587095 mentions this issue: bytes, strings: add Lines, SplitSeq, SplitAfterSeq, FieldsSeq, FieldsFuncSeq
I'm late to the party, but I noticed that the doc comments (at tip) of Lines
, SplitSeq
, and SplitAtfterSeq
(in both the bytes and strings packages) all contain the following statement:
It returns a single-use iterator.
By inspecting the implementation of those functions, I understand why the iterators they return are single-use. For example, the push iterator returned by strings.Lines
is stateful because it closes over that function's string
parameter and mutates it:
func Lines(s string) iter.Seq[string] {
return func(yield func(string) bool) {
for len(s) > 0 {
var line string
if i := IndexByte(s, '\n'); i >= 0 {
line, s = s[:i+1], s[i+1:] // <-------------
} else {
line, s = s, ""
}
if !yield(line) {
return
}
}
return
}
}
However, I'm not sure why the resulting iterators should be single-use. I may have missed something, but I don't think their single-use nature has been discussed in this issue. For convenience, shouldn't iterators be reusable whenever possible/inexpensive?
If desired, making those iterators reusable would be straightforward. For instance, for strings.Lines
:
func Lines(s string) iter.Seq[string] {
return func(yield func(string) bool) {
s := s // <--------- local copy
for len(s) > 0 {
var line string
if i := IndexByte(s, '\n'); i >= 0 {
line, s = s[:i+1], s[i+1:]
} else {
line, s = s, ""
}
if !yield(line) {
return
}
}
return
}
}
We propose to add the following functions to package bytes and package strings, to allow callers to iterate over these results without having to allocate the entire result slice. This text shows only the string package form.
This is one of a collection of proposals updating the standard library for the new 'range over function' feature (#61405). It would only be accepted if that proposal is accepted. See #61897 for a list of related proposals.
Iterating over lines is an incredibly common operation that we’ve resisted adding only because we didn’t want to encourage allocation of a potentially large slice. Iterators provide a way to finally add it.
Iterating over bytes in a string is common and too difficult, since range ranges over runes. This function will inline to the obvious for loop (because we will make sure it does):
Iterating over runes is served by a regular range loop, but like slices.All and maps.All, it could be useful as an input to other iterator adapters. The name is Runes, not Seq or All, so that its clear at call sites what is being iterated over (runes not bytes).
Similar to Lines, there should be iterator forms of Split, Fields, and Runes, to avoid requiring the allocation of a slice when the caller only wants to iterate over the individual results. If we were writing the library from scratch, we might use the names Split, Fields, and Runes for the iterator-returning versions, and code that wanted the full slice could use slices.Collect. But that's not an option here, so we add a distinguishing Seq suffix. We do not expect that new functions will use the Seq suffix. For example the function above is Lines, not LinesSeq.