proposal: fmt: add iterator for format tokens

dsnet commented 3 months ago

Proposal Details

There are use cases for intercepting, detecting, and/or manipulating the format string that eventually gets passed to fmt.Sprintf.

I propose addition of:

// FormatTokens iterates over the tokens in a format string.
// Each token is either string literal without any formatting verb
// or a string with a single formatting verb.
// The former never contains the '%' character,
// while the latter always starts with the '%' character.
func FormatTokens(f string) iter.Seq[string]

Example:

FormatTokens("")         // []
FormatTokens("foo")      // ["foo"]
FormatTokens("foo%")     // ["foo", "%!(NOVERB)"]
FormatTokens("foo%vbar") // ["foo", "%v", "bar"]
FormatTokens("%0.3f%%")  // ["%0.3f", "%%"]

For any valid format string, the concatenation of all the tokens is the format string.

Alternative considerations

The iterator returns values of type string, which can make further analysis of the formatting verb harder.

As an alternatively, we could do:

func FormatTokens(string) iter.Seq[FormatToken]

type FormatToken string

func (FormatToken) IsValid() bool
func (FormatToken) IsLiteral() bool
func (FormatToken) Verb() rune // e.g., 'v', 's', 'f', etc.

Thus, users can call methods on the FormatToken to further drill into information encoded in the token.

I don't think this API complexity justifies the utility as most of those methods can be trivially implemented by the user. For example:

discerning whether a token is a literal or a verb simply requires checking whether it starts with a '%' prefix.
checking for the verb requires just checking the last character
checking whether the token is valid is as simple as rejecting anything that starts with a "%!" prefix.

gabyhelp commented 3 months ago

Related Issues

proposal: fmt: improve ergonomics of fmt.Print-like functions #57816 (closed)

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

jimmyfrasche commented 3 months ago

I've not run into such use cases but I would imagine based on experience with similar things that an AST (or list, in this case) and a printer would either be preferable or not make much of a difference.

The format is fairly simple (I can only recall the syntax changing once and one new directive being added) so this does not seem burdensome to maintain out of std

dsnet commented 3 months ago

The format is fairly simple

Perhaps we have differing definitions of simple, but that wasn't the impression I got. There's ~300 lines of prose explaining the format syntax. I think (but I'm not entirely sure) that one could identify a verb because it always starts with a '%' and ends with a '%' or a letter.

jimmyfrasche commented 3 months ago

I mean that it's structurally simple. There's no recursion, few error cases, and most things of note are single runes.

jimmyfrasche commented 3 months ago

Had to run some tests to verify as the prose is not clear about how all the modifiers mix and match but the full verb format is exactly % flag width .prec [pos] verb where width, prec, and pos are strings of [0-9]+ and everything else is a single rune.

jimmyfrasche commented 3 months ago

Interesting bit of trivia I found poking around with this. You can set flags and such on an escaped % like %+.2[1]%. Go vet catches it as an unrecognized flag (though technically I suppose it should say it as an unrecognized verb) but fmt itself happily treats it as %%.

robpike commented 3 months ago

I argue that #67510 applies here. If no policy is in place, the library will have iterators implemented by a variety of authors with varying goals, with no clear overarching plan and much wasted or duplicated effort, with inconsistent and incomplete results.

Please let's not iteratorify the library before a policy is in place about how such language changes should be rolled out throughout, and what the actual goals should be.

apparentlymart commented 3 months ago

While I certainly agree that the question of when and how iterators should be retrofitted to standard library packages is important to resolve consistently, this proposal seems to be doing two things at once:

Introducing an entirely new capability to package fmt (exposing its tokenizer directly as a public API, rather than it being just an implementation detail of its main formatting functions)
Using an iterator as the representation of the result, thereby making a small contribution to the broad question of where and how the standard library ought to use iterators.

I would suggest that the discussion here ought to prioritize 1 first, since 1 being declined presumably rules out any design with this use-case in mind, making the iterator question irrelevant, whereas declining 2 while accepting 1 leaves the door open to variants such as FormatTokens returning []string instead. (Of course, both would need to be accepted before this proposal in particular could move forward.)

apparentlymart commented 3 months ago

Reacting then to my own previous comment :grinning: I wonder:

Are there any other languages whose standard library formatting functionality -- whether printf-like or otherwise -- exposes its tokenizer in a standalone fashion similar to this?

While I can certainly imagine some situations where I might want to do this, they all feel rather esoteric. I'm curious to see if other languages decided this was something worth exposing, and hopefully also then find some examples of real existing uses of it that could demonstrate the value of exposing this API.

Exposing the tokenizer directly in this way implies constraining future evolution of the fmt mini-language only in ways that wouldn't create ambiguity for a user of this API, but admittedly I'm not really sure how to evaluate that cost to compare it with the potential benefit: the fmt mini language is already constrained to evolve only in ways that wouldn't change the meaning of existing valid input, and so I could certainly see the argument that exposing the tokenizer doesn't constrain the language evolution any more than it is already constrained.

dsnet commented 3 months ago

Are there any other languages whose standard library formatting functionality

I'm not sure of the answer to this question, but I would argue that Go has somewhat set the precedence for strong static analysis within the language ecosystem itself that other language are starting to copy what Go does. I've needed this function in the construction of static analysis tools and also as a mechanism to audit security (e.g., verifying that printf functionality is used in a safe and secure way).

apparentlymart commented 3 months ago

That is a very fair point, @dsnet.

In particular your answer got me wondering if the proposed API could replace the inline parser used (indirectly) by vet.

I'm not very familiar with that code so I'm definitely not the best person to make that call, but from a superficial skim it seems to me like it would be able to do the same things it's already doing using the tokens from the proposed fmt.FormatTokens.

leaxoy commented 3 months ago

To be honest, I haven't seen the motivation and purpose of such a design at all. What should I do after parsing out the format token?

AndrewHarrisSPU commented 2 months ago

Are there any other languages whose standard library formatting functionality -- whether printf-like or otherwise -- exposes its tokenizer in a standalone fashion similar to this?

https://docs.python.org/3/library/string.html#string.Formatter.parse is an example.

It’s called in the common use case of token substitution. So, it could be monkey patched, which seems like the opposite of the utility this issue wants to realize.
It emits (literal text, substitution token, format spec, conversion) tuples. That the parser step is more than one token seems interesting.

I can think of a number of places where the standard library offers some runtime utility for strings that break the code/data barrier. Is there one where the interior parsing is exposed this way?

earthboundkid commented 2 months ago

JavaScript has String.raw, that lets you rebuild a tagged literal. Not quite the same as this, but it can be used for somewhat similar ends.

timothy-king commented 2 months ago

In particular your answer got me wondering if the proposed API could replace the inline parser used (indirectly) by vet.

My first reaction to the proposal was "Oh I would be able to replace the directive parser in analysis/printf. That would be nice". But it is also the case that anything that standardizes the token sequence would work.

There are use cases for intercepting, detecting, and/or manipulating the format string that eventually gets passed to fmt.Sprintf.

@dsnet Do we have motivation beyond static analysis and/or code rewriting? Both of those are already okay motivations. Just wondering if there are more.

FormatTokens("foo%") // ["foo", "%!(NOVERB)"]

Thinking about how to suggest code edits from these for a second. Suppose I want to suggest that "foo%" is replaced with "foo". How do I turn ["foo", "%!(NOVERB)"] into that edit? Feels like I need to understand "%!(NOVERB)" is exactly "%". (That case might be obvious, but I am less sure about other issues.)

So it feels like the error cases are a bit underspecified at the moment. The proposal should probably be clear on what the special error values look like so we know how to edit them. What about instead returning a sequence of pairs of [string, error] where string is the raw substring and error is non-nil on improperly formatted verbs? This gives a nice way to map back to the original in a guaranteed manner (concatenate and quote).

dsnet commented 2 months ago

I'm not sure if we can specify all the error cases without the API being too complicated.

That said, the iterated tokens exactly match the input if valid, so you could at least determine exactly where something went wrong with:

var i int
for tok := range fmt.FormatTokens(f) {
    if strings.HasPrefix(tok, "%!" {
        log.Printf("invalid format token after %s", f[:i])
        break
    }
    i += len(tok)
}

We could also have this be iter.Seq2[int, string] where it returns the starting index of token in the format string. This would be somewhat similar to how iterating over the runes of a string returns the rune and starting index of the rune.

jimmyfrasche commented 2 months ago

I wrote a third party parser for this over the weekend and learned more about it than I thought possible (still cleaning it up and having the fuzzer be mean to me so no link yet).

There are only 4 syntax errors—BADWIDTH, BADPREC, BADINDEX, and NOVERB—everything else is a runtime error. (Though, tbf, some of those errors can be triggered in multiple ways).

For example, "%!" is a syntactically legal verb even though there is not a "!" action to dispatch.

apparentlymart commented 2 months ago

Although earlier I'd argued that the question of whether to expose something like this at all was probably more important than the exact representation of the result, the idea of returning iter.Seq2[int, string] where the integer is a byte offset and the string is a sub-token does seem like it ought to be considered in the context of whether it should become a standard way to represent bytestream tokenization elsewhere in stdlib.

For example, I could imagine bufio.Scanner offering a new Tokens method that returns iter.Seq2[int, []byte] and a TokenStrings method that returns iter.Seq2[int, string]. If something as general as bufio.Scanner used it then it would be a no-brainer for more specialized situations like fmt.FormatTokens to follow it too for consistency, but it would be less obviously right if only fmt.FormatTokens adopted it. Maybe a separate general proposal about tokenization using iterators -- using bufio.Scanner as the primary motivation, but encouraging it elsewhere too -- would help justify this design detail in https://github.com/golang/go/issues/67510 terms.

I do quite like it as a general convention, though! Retaining the context about where the substring came from along with the substring itself means that it'd be trivial to filter the iterator without losing that context, such as a wrapping iterator that skips the literal tokens and only returns the "verb" tokens while retaining the position of each token for use in error messages.

(It's unfortunate that the %! prefix alone isn't sufficient to recognize a token as an error. Is the %!( prefix (with the opening parenthesis included) sufficient?)

jimmyfrasche commented 2 months ago

In %!(, %! is a verb and ( is a regular string.

Even if there were no errors involved, I really cannot imagine how just splitting strings up like this would be generally useful. It may be sufficient for some cases but as soon as you want more information from the verb you end up having to redo a lot of the work that had to be done to split it up in the first place and that's the hard part.

jimmyfrasche commented 2 months ago

Found another weird case:

%[][][]0 - is bad index with [ as the verb
%[][]0 - bad index with 0 as the verb
%[]0 - no verb (???)

robpike commented 2 months ago

The cost of adding this seems highly disproportionate to the value it would provide.

rsc commented 2 months ago

I don't believe an iterator is appropriate here. Iterators are for unbounded iterators. For something this simple, a slice seems fine. Better, in fact, since a slice has the benefit that index i of the slice corresponds to index i of the arguments. So let's narrow the conversation to one of these:

func FormatTokens(f string) []string
func FormatTokens(string) []FormatToken

With the iterator complexity removed, is either of these now worthwhile? This at least lets us focus on the functionality.

Personally I am not convinced.

The proposal does not attempt to justify the functionality at all. It merely says "There are use cases ...". What are the use cases? Are they widespread? Do they merit adding complexity to the standard library as opposed to just making a copy of the relevant code? The lack of answers to those questions is probably why I'm not convinced. :-)

aktau commented 2 months ago

Another use case that came up: splitting logs in literal and dynamic parts.

For example log APIs tend to allow formatting, e.g.:

log.Infof("the result of calc(%q) is %d", args, 5)

We could propose a contract with the user that all of the non-verb parts are literals, which are viewable by anyone. The dynamic parts (meaning any evaluated verb) would be viewable only by those with proper permissions, as it might contain sensitive information.

Of course it would be possible for users to add sensitive parts directly to the format string:

log.Infof("the result of calc(%q) = " + str, args)

Which is why it needs to be a contract (and it may make sense to make a custom analyzer warning about this with the given API).

rsc commented 2 months ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

dsnet commented 2 months ago

I've needed something like this over the years, but the most recent use case for this was something similar to what @aktau said.

In my particular application, I was intercepting a formatting string for constructing HTML. In the verbiage of @aktau, the literal components were preserved as-is (i.e., without any HTML escaping), while the dynamic components were escaped (e.g., converting < to be <). In order to do this, I needed to identify which components were "literals" or "dynamic".

The fact that this was executing upon each Printf-like call, having the API return a []string would be an unfortunate hit to performance. An iterator avoids any allocations and users that want to deal with a []string can use slices.Collect.

aktau commented 2 months ago

The fact that this was executing upon each Printf-like call, having the API return a []string would be an unfortunate hit to performance. An iterator avoids any allocations and users that want to deal with a []string can use slices.Collect.

This could also be mitigated by accepting a backing slice (like io.CopyBuffer), though at that point I think the iterator approach is cleaner too.

robpike commented 2 months ago

Not every problem one finds need to be solved in the core library, and in this case the problem is easily done outside, since it's just a matter of parsing the format string, which is easy enough.

rsc commented 2 months ago

Here is a cut at parsing the format string without package fmt:

// Cut finds and isolates the next formatting directive in the format string.
// It returns the text before the directive (which is literal text),
// the directive itself, and the text after the directive (which may contain more directives).
func Cut(format string) (before, directive, after string) {
    i := 0
    for i < len(format) && format[i] != '%' {
        i++
    }
    before, v := format[:i], format[i:]
    i = 1
    for i < len(v) && (v[i] == '#' || v[i] == '0' || v[i] == '+' || v[i] == '-' || v[i] == ' ') {
        i++
    }
    if i < len(v) && v[i] == '[' {
        j := i+1
        for j < len(v) && '0' <= v[j] && v[j] <= '9' {
            j++
        }
        if j > i+1 && j < len(v) && v[j] == ']' {
            i = j+1
        }
    }
    if i < len(v) && v[i] == '*' {
        i++
    } else {
        for i < len(v) && '0' <= v[i] && v[i] <= '9' {
            i++
        }
    }
    if i < len(v) && v[i] == '.' {
        i++
        if i < len(v) && v[i] == '*' {
            i++
        } else {
            for i < len(v) && '0' <= v[i] && v[i] <= '9' {
                i++
            }
        }
    }
    if i >= len(v) {
        return format, "", ""
    }
    _, size := utf8.DecodeRuneInString(verb[i:])
    i += size
    return before, v[:i], v[i:]
}

It's definitely a little tricky, and it doesn't impose the same limits on literal numbers that package fmt does, and I'm not sure I would call it "easy enough". But I also don't see how it's useful.

Suppose it works. Then what?

I still don't quite understand what @aktau or @dsnet would do with this code. You can't even assume that %d is safe if the argument is not an int, so you need to do some kind of substitution of arguments, probably by formatting each argument separately, putting the results into the args slice, replacing all the directives with %s, and then invoking a top-level formatting (or bypassing the top-level formatting and just collecting all the literals and escaped texts into a bytes.Buffer or strings.Builder). But formatting each argument value separately is not correct for formats like %[2]s. I don't see how having this information actually helps produce sanitized outputs. I can see that it seems like it would help, but I can't connect the dots for a perfect drop-in fix.

And if you're not aiming for perfection, it seems simpler to ignore the format entirely, iterate over the args slice, leave basic integers alone, and replace all other values v with html.EscapeString(fmt.Sprint(v)). That will get %[2]s right. It won't get %-10s right, nor will it get %10d right if the value v has a custom formatter that implements the d verb. But we're not aiming for perfection at this point.

I'm not strictly opposed to the API, but it would need to be useful. I still don't see how it's useful.

znkr commented 2 months ago

I agree, it could only be a part of a solution. What we really want for the use case that @aktau described is a way to understand at what position which arguments are rendered. Let's say we have this code

customlog.Infof("data for user %v not found %[2]T (%v[2])", userId, err),

We want to be able to redact this data to

data for user <redacted> not found <redacted> (<redacted>)

This needs to work across different languages and we would like to be able to show values that are explicitly marked as non-sensitive in the future. The idea is to log the message as a data structure like this (not exactly this, but just to give an idea):

type Sensitivity int

const (
  Unknown      Sensitivity = iota
  CodeLiteral
)

type Message struct {
   Sensitivity Sensitivity
   Value       string
}

func EmitLog(msg []Message)

By understanding the format string, we could separate verbs and literals and format each verb individually. That's not the best solution, but it would work. One idea for a better solution would be an API that allows us to understand how arguments are rendered in the output. Here is an API sketch:

type Info struct {
  Pos, End int  // Start and end position
  Source   int  // The source argument (0 for the format string, 1 for the first argument, ...)
}

func Xprintf(format string, a ..any) (string, []Info)

With that, we could split up the result and tread it however we like.

rsc commented 1 month ago

I still don't quite understand what @aktau or @dsnet would do with this code. You can't even assume that %d is safe if the argument is not an int, so you need to do some kind of substitution of arguments, probably by formatting each argument separately, putting the results into the args slice, replacing all the directives with %s, and then invoking a top-level formatting (or bypassing the top-level formatting and just collecting all the literals and escaped texts into a bytes.Buffer or strings.Builder). But formatting each argument value separately is not correct for formats like %[2]s. I don't see how having this information actually helps produce sanitized outputs. I can see that it seems like it would help, but I can't connect the dots for a perfect drop-in fix.

@adonovan points out that you could wrap every argument in a custom wrapper type that implements fmt.Formatter and then returns the escaped HTML for each of those. That would work, and it wouldn't require this new function.

It looks less and less like we should add this.

dsnet commented 1 month ago

My most recent use case would use this function with something like:

// UnsafeRawf constructs raw HTML text according to a format specifier.
// HTML in f are preserved verbatim, but text produced by % verbs are escaped.
// The only exception is %v or %s formatting a value of the [UnsafeRaw] type,
// which is preserved verbatim.
//
// Example usage:
//
//  firstName := "Malfoy"
//  lastName := "<script>...</script>"
//  UnsafeRawf("%s<br>%s", firstName, lastName) // "Malfoy<br>&lt;script&gt;...&lt;/script&gt;"
func UnsafeRawf(f string, a ...any) UnsafeRaw {
    var sb strings.Builder
    fs := format(f)
    var argIdx int
    var hadIndexedArgs bool
    for t := fs.nextToken(); t != ""; t = fs.nextToken() {
        verb := t[len(t)-1]
        switch {
        case t[0] != '%': // string literal
            sb.WriteString(t)
            continue
        case verb == '%': // escaped '%'
            sb.WriteByte('%')
            continue
        case !strings.ContainsAny(t, "[]"): // verb with positional arguments
            if argIdx < len(a) {
                if raw, isRaw := a[argIdx].(UnsafeRaw); isRaw && (verb == 'v' || verb == 's') {
                    sb.WriteString(string(raw))
                    break
                }
                sb.WriteString(EscapeString(fmt.Sprintf(t, a[argIdx])))
                break
            }
            sb.WriteString(EscapeString(fmt.Sprintf(t))) // deliberate missing args
        default: // verb indexed arguments
            hadIndexedArgs = true
            if strings.HasPrefix(t, "%[") && (strings.HasSuffix(t, "]v") || strings.HasSuffix(t, "]s")) {
                n, err := strconv.ParseUint(t[len("%["):len(t)-len("]v")], 10, 64)
                if err == nil && uint(n-1) < uint(len(a)) {
                    if raw, isRaw := a[n-1].(UnsafeRaw); isRaw {
                        sb.WriteString(string(raw))
                        break
                    }
                }
            }
            sb.WriteString(EscapeString(fmt.Sprintf(t, a...)))
        }
        argIdx++
    }
    return UnsafeRaw(sb.String())
}

Essentially all arguments that are not a "%v" or "%s" verb on the UnsafeRaw type gets HTML escaped.

I hadn't considered the fmt.Formatter wrapping approach, I'll look into it.

willfaught commented 1 month ago

In my particular application, I was intercepting a formatting string for constructing HTML. In the verbiage of @aktau, the literal components were preserved as-is (i.e., without any HTML escaping), while the dynamic components were escaped (e.g., converting < to be <). In order to do this, I needed to identify which components were "literals" or "dynamic".

@dsnet Why couldn't you intercept the print args as a []any before passing them to a formatter?

adonovan commented 1 month ago

Here's a sketch of what I had in mind:

(Choosing a separator that doesn't appear within the format string is left as an exercise.)

https://go.dev/play/p/xtSw9c96ha9

// Wrappers demonstrates how to intercept printf using fmt.Formatter.
package main

import (
    "fmt"
    "io"
    "strings"
)

func main() {
    analyzef("hello %s, %-2.3d\n", "world", 123)
    // Prints: "hello " ("%s" world) ", " ("%-2.3d" 123) "\n"
}

// analyze prints the components of the printf formatting operation.
func analyzef(format string, args ...any) {
    for _, part := range Parsef(format, args...) {
        switch part := part.(type) {
        case string:
            fmt.Printf("%q ", part)
        case Conversion:
            fmt.Printf("(%q %v) ", part, part.Arg)
        }
    }
    fmt.Println()
}

// A Conversion represents a single % operation and its operand.
type Conversion struct {
    Verb  rune
    Width int    // or -1
    Prec  int    // or -1
    Flag  string // some of "-+# 0"
    Arg   any
}

// String prints the conversion in its original form (e.g. "%-2.3d").
func (conv Conversion) String() string {
    var out strings.Builder
    fmt.Fprintf(&out, "%%%s", conv.Flag)
    if conv.Width >= 0 {
        fmt.Fprintf(&out, "%d", conv.Width)
    }
    if conv.Prec >= 0 {
        fmt.Fprintf(&out, ".%d", conv.Prec)
    }
    fmt.Fprintf(&out, "%c", conv.Verb)
    return out.String()
}

// Parsef breaks down a printf formatting operation into literal
// (string) and Conversion components.
func Parsef(format string, args ...any) []any {
    const sep = "!!" // FIXME: choose a separator that doesn't appear within 'format'

    var convs []Conversion
    wrappers := make([]any, len(args))
    for i, arg := range args {
        wrappers[i] = formatFunc(func(st fmt.State, verb rune) {
            io.WriteString(st, sep)

            wid, ok := st.Width()
            if !ok {
                wid = -1
            }
            prec, ok := st.Precision()
            if !ok {
                prec = -1
            }
            flag := ""
            for _, b := range "-+# 0" {
                if st.Flag(int(b)) {
                    flag += string(b)
                }
            }
            convs = append(convs, Conversion{
                Verb:  verb,
                Width: wid,
                Prec:  prec,
                Flag:  flag,
                Arg:   arg,
            })
        })
    }
    s := fmt.Sprintf(format, wrappers...)

    // Interleave the literals and the conversions.
    var result []any
    for i, word := range strings.Split(s, sep) {
        if word != ""
            result = append(result, word)
        }
        if i < len(convs) {
            result = append(result, convs[i])
        }
    }
    return result
}

type formatFunc func(fmt.State, rune)

var _ fmt.Formatter = formatFunc(nil)

func (f formatFunc) Format(st fmt.State, verb rune) { f(st, verb) }

adonovan commented 1 month ago

Hmm... I didn't test this on "%*s", which will try to pop an integer width argument and be disappointed by a formatFunc. There may be a way to parse the BADWIDTH/BADPREC output strings to reconstruct the input, but at that point it would be simpler to go with @rsc's approach.

jimmyfrasche commented 1 month ago

If this is for purely runtime wrapping maybe there needs to be something tailored to that like:

func Wrappingf(w io.Writer, format string, args []any, visit func(state State, verb rune, arg any))

that could handle cases like * and just call visit for "real" args.

It would still be somewhat awkward to use:

fmt.Wrappingf(w, format, args, func(state fmt.State, verb rune, arg any) {
    fmt.Fprintf(state, fmt.FormatString(state, verb), arg)
})

but it would be doable.

rsc commented 1 month ago

If you are trying to defend against injection attacks, Alan's solution works as long as you just don't wrap int arguments. There are no injection attacks coming in from '3' anyway.

rsc commented 1 month ago

Based on the discussion above, this proposal seems like a likely decline. — rsc for the proposal review group

rsc commented 1 month ago

No change in consensus, so declined. — rsc for the proposal review group

golang / go

proposal: fmt: add iterator for format tokens #68279

Proposal Details

Alternative considerations