proposal: crypto/rand: add func Text

FiloSottile commented 2 months ago

Update, Jun 27 2024: The current proposal is https://github.com/golang/go/issues/67057#issuecomment-2161221119

Random strings are useful as passwords, bearer tokens, and 2FA codes.

Generating them without bias from crypto/rand is not trivial at all, and applications are lured into using math/rand for it. See https://github.com/greenpau/caddy-security/issues/265 for example.

I propose we add a top-level function that takes a charset and returns a string of elements randomly selected from it.

The length of the string is selected automatically to provide 128 bits of security. If uppercase and lowercase letters and digits are used, a string will be ceil(log62(2^128)) = 22 characters long.

If more than 2^48 strings are generated, the chance of collision becomes higher than 2^32, but that's also true of UUID v4. Callers that reach those scales could call String twice and concatenate the results, but it doesn't feel worth documenting.

There is no error return value on the assumption that #66821 is accepted. That allows convenient slicing if necessary.

// String returns a random sequence of characters from the supplied alphabet.
//
// The length of the returned string is selected to provide 128 bits of entropy,
// enough to make a brute-force attack infeasible. If a shorter string is
// needed, for example as a one-time token, the caller may truncate the result.
//
// The alphabet is interpreted as a sequence of runes, and must contain at least
// two Unicode characters, or String will panic.
func String(alphabet string) string

This can't really be implemented in constant time, but since it always runs randomly, attackers can get only a single timing sample, which limits the maximum theoretical leak to a few bits.

Do we already have constants for the most common charsets defined somewhere?

/cc @golang/security @golang/proposal-review

robpike commented 2 months ago

Is there a reason to set the entropy to a hidden value rather than provide it as a parameter (other than Fillipo knows best)? What if I need more randomness tomorrow?

The idea is cool but it seems too rigid to me, but I am not a security maven.

FiloSottile commented 2 months ago

I started with taking an int parameter for the entropy, then started going back and forth on whether it should be the length of the returned string (what if the caller does the math wrong, or does it right but later changes the alphabet to be smaller thinking the parameter is the entropy and doesn't need to change?) or the bits of entropy (what if the caller thinks it's the length of the string and passes a way too small number?).

This is one of those cases where we can do the work for the user and do the secure thing directly. So yeah, we can know what the best answer is, so we can save callers the work and risk.

If a caller knows they are fine with less entropy (e.g. for a PAKE or 2FA token), they can slice the string. The performance overhead is unlikely to be a bottleneck, and the slicing can alert a security reviewer to pay attention.

There's no need for more entropy than 128 bits against brute force attacks. If making hundreds of thousands of billions of strings users might need more entropy against collisions, but they will probably have someone on the team that knows that by the time they design such a system, and they can just call String() + String(). UUIDv4 made the same assumption and seems fine.

ulikunitz commented 2 months ago

I find the String() function with a parameter strange because the String() method doesn't usually have parameters.

The approach I usually apply is to use crypto/rand.Reader to get 16 random bytes and converting it with base64.RawURLEncoding.EncodeToString(). The resulting string reflects all the bits in the random value. Using this approach a RandomStringer with a String() method is quite easy to implement. crypto/rand could have a String() function that uses an internal default RandomStringer value. IMHO this approach would be more efficient than a String function that has to validate the alphabet parameter at every call.

jfrech commented 2 months ago

Would String("aab") exhibit a bias?

FiloSottile commented 2 months ago

@ulikunitz Efficiency is not a primary concern here, I don't think applications generate passwords or tokens in a hot loop, at least not hot enough that checking that a string is valid UTF-8 will matter.

Would String("aab") exhibit a bias?

Yes, a would be selected twice as often as b. We should document that.

ulikunitz commented 2 months ago

Is String("aaa") a valid use? What about String("")? Both strings are valid UTF-8. Josua Bloch stated an API should be hard to misuse.

This API can be misused to generate always the same string or biased strings.

The following API doesn't have these weaknesses:

func NewAlphabet(s string) (ab *Alphabet, err error)

func (ab *Alphabet) RandomString() string

const Base32DouglasCrockford = "0123456789ABCDEFGHJKMNPQRSTVWXZ"

// String generates a random string using Douglas Crockford's
// [base32] alphabet.
// [base32]: https://www.crockford.com/base32.html
func String() string

The alphabet will produce slightly larger strings, but according to Crockford is a good compromise between compactness and error resistance. I named the method RandomString() to be explicit about what the method does. The String() function is ok, because it will be used as rand.String() in almost all cases.

I assume that the String function will not support Unicode combining characters.

Some people might want to use the function to create keys for databases. They might be concerned about performance.

ericlagergren commented 2 months ago

I like the proposal, but I'm also concerned about the alphabet just being a plain string. Inevitably somebody will read the alphabet in from a config file or basically anything other than const MyAlphabet = "..." (like attacker controlled data!).

jfrech commented 2 months ago

@ulikunitz

Is String("aaa") a valid use? What about String("")? Both strings are valid UTF-8.

Please refer to the proposed function's documentation's last paragraph (where "two" is sensibly read as "two different"):

// (...)
// The alphabet is interpreted as a sequence of runes, and must contain at least
// two Unicode characters, or String will panic.
func String(alphabet string) string

I urge you not to dilute the expressive power of the standard library by inventing novel concepts (such as *Alphabet) where plain strings suffice.

ulikunitz commented 2 months ago

I apologize for overseeing the panic condition. I suggest however to check for the runes in the string to be unique or as my math teacher used to say pairwise different.

I did an experimental implementation and testing for uniqueness is not dramatically slower than testing for at least two different runes for short alphabets. And with 2 ms per call an Alphabet type is not required.

goos: linux
goarch: amd64
pkg: github.com/ulikunitz/randstr
cpu: AMD Ryzen 7 3700X 8-Core Processor             
BenchmarkString1/base10-16            896253          2265 ns/op
BenchmarkString1/base26-16            782935          1845 ns/op
BenchmarkString1/base36-16            490089          2165 ns/op
BenchmarkString1/base62-16            466987          2162 ns/op
BenchmarkString2/base10-16            591403          2227 ns/op
BenchmarkString2/base26-16            631990          1747 ns/op
BenchmarkString2/base36-16            944583          2174 ns/op
BenchmarkString2/base62-16            488211          2238 ns/op

String1 tests for at least two different runes. String2 checks for a uniqueness.

The implementation can be found here: https://github.com/ulikunitz/randstr/blob/main/string.go

(Updated the comment, had a problem with the computation of the number of runes required.)

ulikunitz commented 2 months ago

After review I observed that I needed more than 128 random bits to ensure that the last character is selected from the full alphabet. The benchmark results are now:

goos: linux
goarch: amd64
pkg: github.com/ulikunitz/randstr
cpu: AMD Ryzen 7 3700X 8-Core Processor             
BenchmarkString1/base10-16                668701              2355 ns/op
BenchmarkString1/base26-16                554954              2759 ns/op
BenchmarkString1/base36-16                526098              2544 ns/op
BenchmarkString1/base62-16                634251              2497 ns/op
BenchmarkString2/base10-16                369050              2782 ns/op
BenchmarkString2/base26-16                552849              2251 ns/op
BenchmarkString2/base36-16                426348              2476 ns/op
BenchmarkString2/base62-16                423656              2515 ns/op

AlexanderYastrebov commented 2 months ago

Previous discussion https://github.com/golang/go/issues/53447

rsc commented 2 months ago

I go back and forth on the name String. I wonder if 'Text' is better.

rand.Text("abcdef")

somehow seems clearer than

rand.String("abcdef")

I agree that 128 bits of entropy is the right amount and unlikely to need to change.

Is the idea that the implementation will always just copy the alphabet into a []rune (or as an optimization maybe a []byte) and then index the right number of times, and then discard that copy?

FiloSottile commented 2 months ago

I go back and forth on the name String. I wonder if 'Text' is better.

I like rand.Text but I wonder if I would think it's more like Lorem Ipsum reading the docs the first time.

Is the idea that the implementation will always just copy the alphabet into a []rune (or as an optimization maybe a []byte) and then index the right number of times, and then discard that copy?

Correct.

As for alphabet validation, if we wanted to be strict, we could disallow repeated characters, as well as Unicode joiner characters, to avoid someone putting in a multi-rune character and being surprised when they are selected separately.

I am a little worried about applications letting attacker-controlled alphabets in and causing panics. A solution would be taking a page out of safeweb, and defining a private string type, so that only string constants and literals are allowed.

type constString string

func String(alphabet constString) string

Not sure why applications would take alphabets as a parameter, and we could just add a line to the docs recommending against it.

rsc commented 2 months ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

mateusz834 commented 2 months ago

Same as in https://github.com/golang/go/issues/66821#issuecomment-2101849949, String() should not be documented that it uses rand.Reader and should not use rand.Reader directly, because it can be replaced, it is a var and errors might get ignored/it might throw.

ulikunitz commented 2 months ago

Will the function ensure that every character of the returned string will be selected from the whole alphabet? If yes than in some cases more than 128 bits of entropy will be required. For example an alphabet of 32 characters will require 26*5 = 130 bits.

It should be a requirement to ensure that all same-length substrings of the generated string represent the same amount of entropy.

jfrech commented 2 months ago

I am probably bikeshedding, but I would welcome a name akin to Token128. Then the paranoid could define their own Token256 and one doesn't burn a name committing to one entropy amount. Possibly I am just squeamish in light of SHAttered.

AlexanderYastrebov commented 1 month ago

If there would be a digit alphabet-based number formatter (say strconv.Format) this could be expressed like:

strconv.Format(rand.Int64(), alphabet) + strconv.Format(rand.Int64(), alphabet)

nightlyone commented 1 month ago

The real requirements of such strings are usually a bit more evolved in that certain parts of the alphabet must actually be present in the output.

The classic "it must contain at least one from 0-9, a-z and a symbol from a predefined symbol set like _-+:" has a different API:

StringN(min, max int, alphabets ...string) (string, error)

which can cover more use cases as well as reduce the amount of abuse by requiring explicit error handling for any of the more interesting cases instead of requiring clever choices of defaults.

Also the real world constraints of valid character ranges and constantly changing min/max constraints on what is a secure password/token means without requiring code changes as this can now be configured.

Last but not least one could measure the amount of bits used by providing a companion function that sets min/max to the same value.

If min/max is too much config, then

StringN(n int, alphabets... string) (string, error)

might be more suitable. Here the bits used could even be returned additionally to facilitate tuning N given those alphabets to match a security target for bits of randomness used.

bjorndm commented 1 month ago

Like @nightlyone states for corporate policies not only an alphabet but also a minimum and maximum length are required. And offer there is a policy of "it should include this character but only once". While those policies are debatable, they are also often non negotiable.

Something like rand.GenerateString(alphabet, once string, min, max int) (string, error) seems like the best suited for this. The characters in once are guaranteed to be in the generated string, but only once.

FiloSottile commented 1 month ago

I don't think compatibility with arbitrary password policies is a goal here. As you state, they can be complex and arbitrary. Good password managers can write their own code to handle this (and usually are not written in Go because they run either in the browser or as native applications).

This is for Go applications that need a random string as an access token, a key, a session ID, and similar settings. Maybe also a password, if they control the policy, but that's secondary.

bjorndm commented 1 month ago

Actually the requirement is often that the Go backend must generate a random password that matches such a policy. There are now maybe a dozen of open source Go libraries of various quality of that allow this. This reminds me of the situation with structured logging. There is a clear need for such a random password generator and that is why I think it passes the bar of x_in_std.

Rican7 commented 1 month ago

This is an interesting addition. Of note, PHP 8.3 recently added a similar function to it's standard library, though it looks like this:

public Random\Randomizer::getBytesFromString(string $string, int $length): string

(The RFC explaining the addition is here.)

I do think the length parameter is nice, and at first I echoed @robpike's sentiment, but I also understand your points, @FiloSottile, for the "we can do the work for the user and do the secure thing directly" bit.

I do think the alphabet bit is important, and documentation could be provided to strongly suggest that devs not take the alphabet as a user-input parameter, but yea without the alphabet bit I likely wouldn't use it myself.

An argument could be made for a more general function (akin to the new PHP one) that is then "aliased" of sorts to a strong default... there's prior art for this kind of thing in the standard library:

As far as this question goes:

Do we already have constants for the most common charsets defined somewhere?

Sort of, but not really...

bjorndm commented 1 month ago

The unicode package maybe?

rsc commented 3 weeks ago

@FiloSottile and I talked a bit about this and we suggest simplifying further:

// Text returns a random string over from the standard RFC 4648 base32 alphabet // for use when a secret string, token, password, or other text is needed. // The result contains 128 bits of randomness, enough to prevent brute force // guessing attacks and to make the likelihood of collisions vanishingly small. // A future version may return longer texts as needed to maintain those properties. func Text() string

In practice neither of us believe that any future extension will ever be necessary: 2^128 is a very large number.

For a brute force attack running at 1 guess per nanosecond, 2^128 ns is 10^24 years. Even at a billion guesses per nanosecond that's still 10^15 years.

A fixed collision can only happen with probability 1/2^128 = 1/10^38. A collision in a set of N Texts happens with probability roughly N^2/2^129, so if you generate fewer than 2^32 such texts, the probability of any pair colliding remains vanishingly small: < 1/2^65.

The math here does not change significantly as computers get faster, which is why we believe 2^128 will remain the answer approximately forever. But if something did change, Text reserves the right to start returning longer results.

Using a hard-coded alphabet avoids all the complexity of defining bad alphabets, Unicode, and so on. The base32 alphabet is avoids mixed-case, avoids punctuation, and is reasonably compact: 128 bits encode in 26 characters, compared to 32 for hex and 22 for base64. (The specific alphabet is base32.StdEncoding = ABCDEFGHIJKLMNOPQRSTUVWXYZ234567.)

Arguably we could improve upon the exact 32 characters, as Crockford attempts to do, but using a standard seems a more important benefit than what we'd get by designing our own. The alphabet is upper-case, so if the texts do have to be typed manually, there is no ambiguity between I and L (note that 0 and 1 are not in the alphabet at all).

ulikunitz commented 3 weeks ago

I support the proposal and it removes the whole problem of selecting the alphabet. There is one question: 26 base32 characters can represent 130 bits. Shouldn't the function generate 130 random bits to avoid that one or two character are not selected from the whole base32 alphabet?

Rican7 commented 3 weeks ago

@rsc @FiloSottile I really do think the standardized length is a good idea. 128-bits is definitely secure enough for almost every case imaginable, and if less is needed then Go makes it incredibly easy to take a substring length of the result with a slice expression (rand.Text()[:64]), and if more is needed than the calls can just be appended, which is made even easier with the single value return.

I am, myself, less convinced about the hard-coded alphabet, though. Given those constraints, I'm likely less apt to use it in my own projects, and would rather re-implement it for my own projects just to have a different alphabet... which I think defeats much of the purpose.

I still think it might make sense to have a func TextFromString(alphabet string) string that allows a custom alphabet, and then a func Text() string that simply calls TextFromString with a hard-coded alphabet (return TextFromString("ABCDEFGHIJKLMNOPQRSTUVWXYZ234567")), or something similar.

I think doing so would provide an "idiot proof" default, yet also provide a helpful custom path to deter custom, insecure implementations of the same randomization.

robpike commented 3 weeks ago

The mixed alphabet greatly reduces its utility. Perhaps we need a pair of functions, one with a fixed alphabet and one with a user-provided one.

Rican7 commented 3 weeks ago

Ha, I believe we're saying the same thing, @robpike. 😃

bjorndm commented 3 weeks ago

I agree that we need to be able to select the alphabet.

Furthermore the standard alphabet would better be the Crockford one as the alphabet proposed here is likely to generate the occasional English profanity.

ofpiyush commented 3 weeks ago

If we're:

allowing other alphabets
splitting the API into "works for almost everyone" and "gives you control"

Allowing for flexibility on both length and alphabet with documentation on why most people should use the default one could be useful.

Something like

alphabet, err := rand.ParseAlphabet("abcdef")

to parse the string and make alphabet from it.

length := alphabet.LengthFor128()
alphabet.Text(length)

Allows us or user to add LengthFor256, also lets user select length for themselves if they want a shorter/longer fixed length and don't really care about 128 bits.

rand.Text()

Returns a text that's safe from brute-force attacks. Length of returned string might vary over the years/decades.

ericlagergren commented 3 weeks ago

I support the proposal and it removes the whole problem of selecting the alphabet. There is one question: 26 base32 characters can represent 130 bits. Shouldn't the function generate 130 random bits to avoid that one or two character are not selected from the whole base32 alphabet?

@ulikunitz Each character in an alphabet of length N is chosen with a probability of 1/N.

rsc commented 3 weeks ago

There is a lot of potential complexity we could add. It still seems like Text is worthwhile by itself, and we could add the more complicated API if it turns out that there are a lot of use cases that Text does not serve.

A few comments have said that a fixed alphabet makes it less useful or that we need to be able to change the alphabet, but there are no use cases given. Concrete use cases would help. When @FiloSottile and I were discussing this, we couldn't come up with any compelling reasons why that alphanumeric alphabet wouldn't be okay. And there is a lot of complexity that goes away if you don't have to deal with arbitrary alphabets. (For example, what happens with repeated runes, non-1-byte runes, and so on.)

If purely numeric is the main objection / use case for other alphabets, then maybe rand.NumberText (or a better name) could be added too.

cespare commented 3 weeks ago

Overall this proposal seems quite nice to me. I've implemented token generation code quite a few times and this would probably work for many of those instances.

At my company we typically use base62 for these tokens. This avoids the punctuation characters in base64 but otherwise keeps the alphabet as large as possible so that the tokens are as compact as possible. For 128 bits, a base62 representation fits in 22 characters (same as base64).

Overall, the base62 tokens just seem nicer and less shouty to me:

ZmDXJCodUJBDmMq0yZr2GG
E7TIQn5TJX9xkZrxCphPZ5
qY7Mm0IrJQEiM2gD2WJ5N1
xwm0ezZrr048VHJ3Kw7KOF
BSXR5EsuLZ8HRaZlccIluJ
VS8p0QbSs8Dbx9KE12H3aC

vs

HQ7NYW4KWSLM432JBKDLNFBJXA
7VLPW5M7QOLKTZQQLQ6XBQMQDI
QCKVH4FX7BQ432JQPS2NRA4MYY
7NBK33HDBUYAN2757P7EQCQXKA
HH37REM6JLBFY6LIQPP5HC7X7I
ZCNMMZBPRNEZHGAFCOH4WSTTDQ

The reason we found to use base62 over base64 (which we used a lot more in years past) is that the punctuation in various base64 alphabets really does cause issues in places like filenames and URLs (even though - is "safe", for example, it's common to have filename conventions where - is used as a separator) as well as little annoyances such as "what gets selected when I double-click this string".

By contrast, the reasons to prefer base32 over base62 mostly focus on the potential confusion between upper/lowercase and lookalike letters/digits, and for our use cases these are just not real issues. Mostly people don't need to transcribe long random tokens over the telephone or otherwise transmit them in situations where confusion is a problem. If you are generating a password, then any version of rand.Text seems too naive; more complex code will typically be needed anyway to handle specific password requirements.

Anyway, I doubt I'll get buy-in on this, but I wanted to relate our real-world experience with picking an alphabet for token generation.

ericlagergren commented 3 weeks ago

@rsc I think your most recent proposal is ideal.

As a thought experiment, though:

Concrete use cases would help

Longer alphabets allow shorter outputs at the same security level. A few years back I worked on a project where token length was a primary concern. One of our proposed fixes was to just increase the size of the alphabet.

Rican7 commented 3 weeks ago

A few comments have said that a fixed alphabet makes it less useful or that we need to be able to change the alphabet, but there are no use cases given. Concrete use cases would help.

Sure, I think I can provide some help there.

First and foremost, the proposal initially states that the utility (and motivation) for a feature like this is "Random strings are useful as passwords, bearer tokens, and 2FA codes". In many cases, these kinds of values won't be exposed to users, as they'll be encoded in cookie values or other abstracted mechanisms, in which case the chosen alphabet rightly doesn't matter much. In some cases, however, these will be exposed to users, either as visible URL paths or arguments, such as in password reset links or in public "share" links (think Google Photos or even the Go Playground), in which case the chosen alphabet and length is often a specific application design choice with a trade-off of collision potential, "guessability", and aesthetics (again, as we can see in the Go Playground source). Finally, in other cases, these values will not only be exposed to users, but actually expected to be typed by application end-users, such as in the common modern case of 2FA/MFA codes as the original proposal even suggests, in which case the chosen alphabet is very important to reduce user error and improve user experience and accessibility.

In any case, the choice of a base-32 alphabet seems well intentioned (and I personally like it), but it's a relatively arbitrary one, is it not? Again, while I'm a fan of the tradeoffs of compactness and readability of base-32 alphabet encodings, there's been a pretty large distributed "debate" recently, on which encoding is best used for these kinds of things, which has led to many different approaches from different projects:

... so you can see that there are quite a few different opinions on encoding alphabets, with varying popularity, but overall a massive interest and/or user-base among these different projects. So, again, I think limiting to a specific alphabet would limit it's use to the point where developers would likely implement their own versions of this function, and likely do so insecurely (mistakenly), which seems to conflict with a big motivation for this proposal's existence.

bjorndm commented 3 weeks ago

@rsc A concrete use case is the generation of user facing passwords. Searching https://pkg.go.dev/search?q=generate+password+ yields 20+ packages that do this, so the need for this use case seems clear to me.

ericlagergren commented 3 weeks ago

This proposal is for generating random strings from an alphabet. Generating passwords from a schema is significantly more complex for anything other than the most simple cases.

earthboundkid commented 3 weeks ago

I've written two things that intersect with this proposal:

A password generator. For that, not allowing user defined strings would mean I couldn't use rand.Text.
crockford.Random which returns 2^64 random bits in Crockford32 format.

I don't use the password generator anymore (I switched to using browser JavaScript), but I am still using crockford.Random in production. It would be nice if this were compatible with that.

thepudds commented 3 weeks ago

And there is a lot of complexity that goes away if you don't have to deal with arbitrary alphabets.

Overall, I like this approach. Two small thoughts:

Perhaps the implementation could be written such that it is trivial for a user to copy the code and substitute another ASCII alphabet if needed. There could be an internal comment suggesting this for people who need a different alphabet (or perhaps even in the public doc text, though I don't know if there is precedent for that).
Alternatively, perhaps ~5 integer constants could be defined for the ~5 most common standard alphabets, along with a corresponding rand.Text parameter. (However, doing this might lead to a slow drip of people asking for that list of alphabets to be extended).

(Also, based on my experience, I'm sympathetic to many of the points @cespare makes in https://github.com/golang/go/issues/67057#issuecomment-2163617980, especially around things going wrong due to punctuation because of encoding or via copy/paste in emails and ticket systems and whatnot).

rsc commented 2 weeks ago

I hear the comments about some cases needing custom alphabets and the like. Those cases already have to write their own code, so they won't be helped by Text but also not hurt. It seems worthwhile to add Text first, which will help many cases, and then see what's left after Text is in use.

Do I have that right?

earthboundkid commented 2 weeks ago

It seems like holding the more general version of this won't hurt anything. If we revive the generic set proposal, this could come back then as a function that takes a set.Set[rune] and returns a string.

rsc commented 1 week ago

crockford.Random which returns 2^64 random bits in Crockford32 format. ... I am still using crockford.Random in production. It would be nice if this were compatible with that.

Obviously it would be nice to be compatible with existing code, but it seems clear to me that Text should not return only 64 bits. The discussion here was about whether 128 is enough. I think we all agree 64 is too close to "not enough" for comfort. And unless you are decoding the texts, it doesn't matter exactly how they are decoded. You could probably do rand.Text()[:N] for a suitable N if you did want to replace your existing code.

rsc commented 1 week ago

Have all remaining concerns about this proposal been addressed?

The proposal is to add to crypto/rand a single function:

// Text returns a random string over from the standard RFC 4648 base32 alphabet
// for use when a secret string, token, password, or other text is needed.
// The result contains at least 128 bits of randomness, enough to prevent brute force
// guessing attacks and to make the likelihood of collisions vanishingly small.
// A future version may return longer texts as needed to maintain those properties.
func Text() string

Other features like custom alphabets and custom lengths are deferred to future proposals once we see how well Text works.

ericlagergren commented 1 week ago

@rsc maybe this is better left as a comment on the CL, but the docs for Text should probably include the words "cryptographically secure."

jfrech commented 4 days ago

@rsc s/over from/over/ ; Why was "bits of entropy" replaced by "bits of randomness"?

golang / go

proposal: crypto/rand: add func Text #67057