jirka-h / haveged

Entropy daemon ![Continuous Integration](https://github.com/jirka-h/haveged/workflows/Continuous%20Integration/badge.svg)
GNU General Public License v3.0
270 stars 33 forks source link

Entropy increase suggestion - math part only #79

Open B1u3l1ght opened 1 year ago

B1u3l1ght commented 1 year ago

Hello, I've heard about this amazing project and its goal to increase entropy for a stronger security. I've developed recently after a long road (3-4 years) of personal research a formula that might be used to obtain superior entropy, either we are talking about password generation or bits of entropy. This formula was submitted to OEIS and was approved twice, but without an explanation as we speak it remained in proposal status, from what I've understand the chosen name for the formula was not on editors liking.

To prove my point making this story short I'm gonna briefly show mixed arrangements (how I called them) 2^40, 40 bits based on element repeating (knowing processors work only with 1 and 0). The Idea is simple each string belongs to a pattern these patters can have more or less variants, the goal is to use that pattern that holds the bigger number of variants to obtain superior entropy. From the example bellow we can see the least numerous pattern is the first one that holds only 2 variants (string formed from only 1, and second the string is formed from 0). Based on this criteria patterns hold bigger or smaller number of possible arrangements. The exact calculation of these patterns can be made only thru my recently discovered formula. Now my proposal is that instead of randomly choose from all those possible patterns to use only the 'slice' with the bigger amount of arrangements and because this pattern is still lower than the whole to just increase the length of the string to match the previous entropy but using only that bigger slice.

So instead of 40 bits to have 43 but use from it only that slice with the biggest number of arrangements. For 44 bit string the beefiest slice would be when we have like 21 of 1 and 22 of 0 or vice versa 21 of 0 and 22 of 1. That pattern/slice being approx. 2x bigger (~2*10^12) than the whole 40 bits (~10^12) The reason to do this would be to rule out those extremely guessable patterns and their strings. So instead of having that good old random were gonna have a so called targeted random minimizing almost completely the eventuality of anyone guessing kernel memory allocations.

If I've managed to lit up some light of interest I can go further into details about formula and so on. Last thing that I wanna add is that I'm not a programmer (just some basic stuff some python here and there etc.) the only part I can touch is the math part of this topic.

nh43

M symbolizes mixed arrangements for different patterns.

jirka-h commented 1 year ago

Hi,

thanks for reaching out! It's definitively interesting!

It sounds similar to a whitening transformation: https://en.wikipedia.org/wiki/Whitening_transformation

If you want to share more details, we might try to develop a code to perform the transformation/filter - reading in a random data and performing the operation you describe.

I have also checked your other project: https://github.com/B1u3l1ght/Old-School-Pass#readme

Could you please point me to a script that performs the password generation as described in the README file?

Thanks a lot Jirka

B1u3l1ght commented 1 year ago

Ok great here's the formula and I'll explain step by step all its tricks. screen_2023-05-15_18-26-11 where

M - mixed arrangements (number of arrangements that form based on elements repeating or not for a certain pattern). T - total number of elements/chars we choose from e.g. a-Z, symbols, 0-9 => T=94 (without space symbol). S - chosen string length (can be bigger or smaller than T doesn't matter but with a small mention see b). b - this is the backbone pattern, (bbp). Patterns are ways to sum to S. The only small mention but important is that when T<S patterns are ways to sum to S but from ≤T number terms. Z - this is the number of unique elements in the PATTERN, e.g. for pattern 3 3 2 2 2 1 => Z=6
('zipped' string length or the number of digits of the b). m - the number of repetition of n element/char. e.g. 3 3 2 2 2 1 => ma=3, mb=2 (unique elements do not form an m, m symbolizes repetitions only. If bbp. has no repeating element => m=0 e.g pattern like 111111..1 has no repeats but fundamental/uniques only. n - the number of times an m with the same value shows up in the pattern. e.g. 3 3 2 2 2 1 => na=2, nb=3. k - m×n, e.g. 3 3 2 2 2 1 => ka=ma
na=32=6, kb=mbnb=2*3=6

Bellow two examples,

1st when T≥S (typically for a pwd calculation) T=8, S=8 one of the ways to sum to 8 (i'll provide a script that gives the exact number of ways to sum to a certain S)

2+2+1+1+1+1=8 (took the beefiest bbp.) 22111100 M(8,8)=A(8,6)×(8!/(2!^2)×2!(8-4)!)=8!/2! × 210 = 4.233.600 while the whole has 8^8=16.777.216 there are 22 ways to sum to 8 Screenshot-2023-06-05-19-38-24

Note: there's no way even if the string is huge inevitable some patterns are weak (have very low number of arr.) that can be brute forced, this is why there's need to choose the best b that would simply reduce the probability of an accident to the lowest possible.

Always the sum of arrangements from all slices have to add up to T^S => Σ=T^S, this is good way to verify if the script or hand calculations are correct.

2nd example when T<S (string longer than available unique elements, e.g useful for binary/decimal/hexadecimal or PIN/codes etc) Screenshot_2023-06-05_20-15-00

The reason best pattern in the case of 40 bits is asymmetric is because once an 'm' shows up twice, the second factor of formula is inverse proportional to that m raised to the power of times that m shows up (in this case twice) (20!^2)*2!.

As a general rule an 'm' should be as small possible (m=2) but not 0 (in that case there's no amplification factor at all but still not that bad), and fewer possible. This applies for 'n' also, n can be 1 unlike m that needs to be either 0 (no reps.) or from 2 and up. A 2 in a pattern signifies that element/char shows up twice e.g. 211111... sample string AebcAjh

Another general rule when T is significantly bigger than S the most beefy patterns are those that tend to have the least repeating elements or no reps at all e.g. bitcoin seed, T=2048 dictionary words, S=12 or 24 words that form bitcoin mnemonic phrase.

Bellow the script sum to S (see some comments inside script) and an example of calculation of a large string. I'll add a script too a script that calculates some of the patterns. Off the record: My python sill not that good (I'm at the while loops and didn't had much time lately). https://i.postimg.cc/Wz73n3z0/Screenshot-2023-04-25-11-21-12.png

`def NumberOfways(N, K):

# Initialize a list
dp = [0] * (N + 1)

# Update dp[0] to 1
dp[0] = 1

# Iterate over the range [1, K + 1]
for row in range(1, K + 1):

    # Iterate over the range [1, N + 1]
    for col in range(1, N + 1):

        # If col is greater
        # than or equal to row
        if (col >= row):

            # Update current
            # dp[col] state
            dp[col] = dp[col] + dp[col - row]

# Return the total number of ways
return(dp[N])

Driver Code

N = 40 ## S string length (when T≥S both parmeters N an K should be the string length; when T<S, N=S and K=T) K = 2 ## T total elements (default, script shows ways to sum to 40 from when T=2)

print(NumberOfways(N, K))`

jirka-h commented 1 year ago

Hi,

I'm sorry for the delay in my communication. I was busy with other stuff.

I have reviewed your formula and in fact, this is a frequency check. This is already covered by the internal continuous/life tests - please see https://github.com/jirka-h/haveged/blob/master/README.md, section "TESTING haveged":

Run-time tests are broken into 2 groups, Procedure A containing 5 tests, and Procedure B containing 3 tests. Procedure A consists of performing a 48-bit disjointedness test on 64K sequences, followed by 257 repetitions of the four FIPS-140-1 tests and an auto-correlation test on successive 2000 bit sequences. Procedure B performs distribution tests for 10,000 occurrences of 1, 2, 3, 4 bit runs in successive samples, followed by a entropy estimate based upon on another 256000+2560 bit sample. A sample must pass all individual tests to pass the procedure.

To test it yourself, you might want to examine these random strings consisting of all printable characters, not including space ([:graph:]). Here is the sample command and it's output - feel free to modify the length of random strings as needed:

$(haveged -n 1000 -f - 2>/dev/null | tr -cd '[:graph:]' | fold -w 16 && echo ) | head
W}8|=g[]&pW3Iy{m
-p!Xv,."6_{qugK?
\a`gN+^&p%KvVY5f
^%dC~uCu~eFs`%?N
Q0!wB8wdq+,b6G8d
Yu;R?Ma;}ZuMK<Bf
hNAnIU}.]Jh4GGSj
G`IS:O<l!b~mu-A<
J8+Yy[vc-e>{wUJt
&;+^QWq]&)mi({]h

Thanks Jirka

B1u3l1ght commented 1 year ago

Hello, no problem

Screenshot_2023-06-27_11-55-21

Now my proposal is to do it like this:

Calculate the required/desired entropy. For example when we need T=94 and S=16 => T^S=94^16 = 3,71×10³¹ (maximum possible).

To avoid weak patterns that give weak strings (0.83% etc) were gonna increase the length of the string and 'target' the beefiest pattern.

So we have T=94 but increase S=16+1=17

In this case best pattern (chassis or array however we wanna call it) is that with 21111111111111110 (similar but huge numerical different)

So now were gonna have our 94^16=3,71×10³¹ desired entropy inside only one pattern 1,30×10³³ (just by increasing the length but targeting exclusively only one of the pattern). You can see that this pattern alone is 35x stronger than the whole 94^16 initial cake.

This way ALL (not only SOME) strings gonna have desired entropy 1,30×10³³.

So we have to instruct our pwd generator to give ONE element at random that repeats, in random way, and 15 elements at random that are uniques (show up only once) => 21111111111111110 (the TWO in the pattern symbolizes that ONLY one element repeat)

This concept of TARGETING RANDOMNESS ensures no accidents happens or better said eliminates completely accidents of a bad string. These patterns have to be predefined (pre-calculated) so the kernel doesn't waste time calculate things at that low level.

For this I've developed a python pwd calculator script. I'll give an example to check it out how to use it.

so for a pattern let's say 443222110 T=94, S=19 ( S =4+4+3+2+2+2+1+1=19)

'ems' are repetitions 4 3 2 'ens' are how many time those 'ems' occur in the pattern 2 1 3

If pattern has no 'ems' and hence no 'ens' e.g patterns like 1111111111.. consisting in unique elements only, just hit enter.

Screenshot_2023-06-27_11-37-51

Here's the python script to calc. best patterns. I'll provide you (if you're interested further) with a so call benchmark bash script that I made to check string patterns. But here's the pwd calc I've made.

`print("enter your T and S") T, S = map(int, input().split())

J = T**S print('Total variants T^S:', (format(J, "e"))) print(J)

print("your ems") print("insert your ems, e.g. 5 4 3 2") array1 = list(map(int, input().split()))

print("insert your ens e.g. 4 3 2 1") print("your ens") array2 = list(map(int, input().split()))

Resize the arrays by filling with zeros

array1 = array1 + [0, 0, 0, 0] array2 = array2 + [0, 0, 0, 0]

m7, m6, m5, m4 = array1[:4] n7, n6, n5, n4 = array2[:4]

ens = int(n7) + int(n6) + int(n5) + int(n4) kmn = int(m7) int(n7) + int(m6) int(n6) + int(m5) int(n5) + int(m4) int(n4) Z = int(ens) + (S - kmn) print('Your Z is:', int(Z))

import math

a = math.factorial(T) b = math.factorial(T - Z) if T - Z >= 0 else 1 # Add check for non-negative argument c = math.factorial(S) d = ( math.factorial(m7) ** n7

Thanx for your interest. As a conclusion from my point of view, strings generated by your method may be good, but as I see it has no 'mechanism' in getting rid of 'accident string' nor other similar generators that consider whole cakes of possibilities. I'm 100% sure a targeted randomness and combined with your approach would give best results. Thanks again for your time and effort putted into this bold project of yours. Cheers!

jirka-h commented 1 year ago

Hi,

thanks a lot for the detailed walk through! It helped me a lot to fully understand your thought process and proposed method.

I'm thinking about writing a script which would read random bytes from the stdin (for example from /dev/random) and create passwords of desired length from given alphabet using the concept of TARGETING RANDOMNESS. In essence, it would work similarly as naive approach with tr -cd '[:graph:]' command, in a tradition of Unix tools chained via the pipe, but using TARGETING RANDOMNESS approach.

Now to make the script user friendly, you will be required to provide only password length and have methods to alter the alphabet. The script would then

I will share an inital version once it's ready. It might take a while because of summer holidays. Stay tuned.

Thanks Jirka

B1u3l1ght commented 1 year ago

Great news! I'm confident this approach of aiming at the core of these arrangements will considerably increase security/strength of pwds and not only. Here's that benchmark script to obtain its pattern from a string. You just need to insert the string in the 'fin' file then execute x.sh. (not the best coding skills but it works).

https://github.com/B1u3l1ght/Old-School-Pass/blob/main/pass_bench.zip

Because an adversary doesn't know the chosen pwd length it can not know which pattern is used even if in the future will hear about this method. For the case where string length is known can be used t random top 3 or 4 strongest patterns to leave even the most determined attackers into a smoke curtain.

Thanks and happy holidays 🌴🌴🌴 😉

B1u3l1ght commented 1 year ago

Returned briefly for better solution obtaining bbp. from strings. These takes randomly generated strings from an 'a' file (not just one string at a time, column format) and gives their bbp in 'b' file and later can be assessed their strength via pwd calculator.

perl perl -F'' -le '$,=" "; $c{$_}++ for @F; print sort{$b<=>$a} values %c; %c=();' a > b

same thing but in python https://github.com/B1u3l1ght/Old-School-Pass/blob/main/countz.py

And another important script this one that gives all possible bbp. upon provided S. When T happens to be smaller than S some of these bbp. do not count. bbp's have to be ways to sum to S starting from at least 1 factor till at the most T number of factors (and not S number of factors like in the T≥S case)

https://github.com/B1u3l1ght/Old-School-Pass/blob/main/bbp.py

Tnx and have a nice weekend.

jirka-h commented 1 year ago

Thanks a lot for sharing your scripts!

I have used the perl script to verify that my initial passwd generator works correctly, generating passwords according to the input backbone pattern.

The next step that I need to implement is to

To generate all bbps, I will reuse the code from your script bbp.py.

Could you share the code to compute the probability for given bbp, given the value of T and S?

Thanks! Jirka

B1u3l1ght commented 1 year ago

I only have this mix.py script that expects user input for T and S, 'ems' and its 'ens'.

Coding not my strongest point currently..

I guess it can be modified so it takes patterns generated by the bbp.py and feed the modified mix.py script.

I've decided to express probability as percentage. The number of possible arrangements of the bbp divided by all possible arrangements T^S. So it's ((bbp arr. number)/T^S)*100

This part of the code (at the bottom) from mix.py calculates probability: g represents the number of variants calculated for that bbp, h is the division between g and J (which is T^S) and all multiplied 100x the bigger the percentage the better bbp.

However the more we increase string length the more bbp form (cake can be broken into more slices) and so the % goes even at 3-4% for the strongest but those weaker goes into the 0.04% or even 10^(-9)%.

Screenshot_2023-07-02_16-02-51

I think there's need to modify bbp.py too so it can output those bbp's that match the case when T<S. Currently it gives good results only for T≥S.

For long string e.g 63 it can give huge number of patterns about 1.5 mil so it can crash the system.

Tnx.

jirka-h commented 1 year ago

Thanks for the pointers - I can take it from here.

For long string e.g 63 it can give huge number of patterns about 1.5 mil so it can crash the system. I was thinking about this as well. If we limit ourselves to T>S case for the start and recall that we search for the best bbps, we can safely omit the cases where number of repetitions is high. When we limit a single letter to repeat max - let's say - 4 times, the number of possibilities reduces significantly.

I will try to come up with a initial version for T>S case in a week or so. Stay tuned.

B1u3l1ght commented 1 year ago

Ok, in the meantime I saw that mix.py aka pwd calculator script gives erroneous results. I'll get back to tell why it gives bad results. The math behind it works but was not properly implemented in the script.

edit: The guy that helped me with the script (mix.py) seems to have messed up the formula.

edit2: Looks like this part does not reflect the formula. Screenshot_2023-07-09_16-34-42

edit3: Think I got it. When crunching large number, python precision drops. That's why mix.py gives some 'error', but it's not something about the script but with the accuracy of how python deals with large numbers.

Using gnome calculator A(94,12)=94!/82!=228743740236102111820800 Python script lose accuracy after 17 digits 94!/82!=228743740236102111330304

228743740236102111820800 - gnome calc high accuracy 228743740236102111330304 - mix.py (margin of error is in the hundreds of thousands) It loses 490496 on the way, there's need for high accuracy as when using large numbers slices have to fit exactly T^S

B1u3l1ght commented 1 year ago

In the meantime fixed the problem. To boost python precision had to install a module called python-mpmath. Everything checks perfectly while mapping a whole 94^16 (16 chars long pwd with a search space depth 94, without space char).

You can check it out here

I'm gonna 'correct' python script mix.py and add mpmath syntax. There's need to kick precision as high as 1000.

edit: Added the fixed version of the mix.py now called remix.py

Used 'abs' (absolute syntax) as I've seen it gives higher accuracy this way.

Cheers and tnx.

jirka-h commented 1 year ago

Thanks for the updates!

I'm leaving for summer vacation and I will have no computer with me. I will look into this in August when I'm back.

Enjoy the summer (if you are based on The Northern Hemisphere) Jirka

B1u3l1ght commented 1 year ago

Ok great, I'll take couple of days off too going to some mountain region for cooler air cos here somewhere in the South-Eastern EU is boiling hot 🫠.

Have a nice relaxing time! 🌴 🎾 🍹

B1u3l1ght commented 1 year ago

Hello. How was your vacation or you still on it 😉. Made a larger simulation with your generator as is currently. Tried to simulate 1500 users generating their strings randomly. The result highlights are as follow:

  1. One string went as low as 17k weaker than the best (can be cracked in ~48 hours) while a strongest one in 127 years.
  2. 88 strings out of 1500 are at least 10x weaker (out of the 'comfort zone').
  3. 9 strings are 53x weaker.
  4. 7 strings are at least 367x weaker.

Another observation I've made is that the moment we extract in bulk many random strings generator seems to perform better (strings have high entropy), when we extract fewer at once, generator seems to lose some entropy stamina.

You can see results in the image bellow. The red line is the 'comfort zone', everything bellow that is 10x weaker.

As a solution so we don't have to calculate every time specially for very long strings we can instruct random generator to sample 10-20 (or even 50, the higher the the better certainty) strings extract their patterns and the one that shows up more often than the others should be used as being the best.

EDIT 1 : however for the large strings but not only for those still can happen some bbp. to show up as often as other even if are not that strong so to solve which one is better should still go thru formula that can finally establish the winning bbp. The larger the string length the bigger the need for a larger sampling to establish the winner that has to pass the formula test also. EDIT2: for a XXXL string e.g. 94 long the method of sampling randomly becomes unreliable due to the fact all patterns becomes vastly numerous and thinner (1st has 0.88%) however the best will still pop up but not as the most frequent pattern necessarily.

F125fTYWIAAxwy4

Cheers! ☮️

jirka-h commented 1 year ago

Hi,

thanks a lot for running the larger simulation! I have checked the frequencies for the first three categories and they match the expected values

595/1500: 39.6% (expected 39.25%) 379/1500: 25.3% (expected 25.84%) 358/1500: 23.9% (expected 22.32%)

This verifies that haveged RNG runs correctly and also that the formulas and their implementation are accurate.

Creating a Python password generator is still on my TODO list. While I'm back from the main vacation, I'm still offline most of the time, either traveling or being busy with other assignments. Realistically speaking, I will be able to look into it by mid-September when the school vacations are over and the kids are back at school. I'm sorry, but I need to balance my family life. 

Thanks for the understanding and your patience! Jirka

B1u3l1ght commented 1 year ago

That's ok there's no rush.

Regarding RNG and the formula, yeah both of them work correctly and theory meets practice and vice versa. Once the new pwd generator will be done my bet is that will be the new reference point on pwd realm and not only. Cryptography in general can be significantly improved also.

Have a nice weekend! Bluelight

B1u3l1ght commented 9 months ago

Hello Jirka, Happy New Year! I'm having some novelties. For even a deeper dive into this complex task we started here. Managed to obtain a broader formula. This one covers not only S patterns but also T patterns (groupings). This is useful when we want to determine what's best to choose when we consider total elements grouped. For example we need to know which S pattern is best for a T pattern.

Example of a T pattern: 32 symbols 26 Cap letter 26 low cap 10 (0-9) numbers

32 26 26 10 = 94 = T

Surprisingly OEIS guys were significantly more pleased about this one but still there's a lot of slow advance as their foundation works too on donations business model.

Since it got broader and we introduced a T pattern (grouping) named this kind of arr. Split Arrangements (S4) S index 4 means there are 4 groups in this case (32 26 26 10).

S (pattern) 6 5 4 1
S4(94,16) T (grouping) 32 26 26 10

The above it's established as being the strongest S pattern 6 5 4 1 when we consider 32 26 26 10 grouping. So based on the grouping were gonna have different strongest S array.

T grouping can not contain any 0 (zeroes) so it can not be 32 0 26 26 10 while the S array can have zeroes like

10 6 0 0 10+16 = 16 = S array 32 26 26 10 ... = 94 = T

Now the previous formula (mixed arr.) always took one way of grouping, per element only. And so T pattern was always like: 1 1 1 1 1..1 longer or shorter upon T number of elements but if we need to group elements based on their type the 111..1 T pattern not gonna be much helpful.

Bellow I've attached an img. that shows the complete 'dissection' when T=6 and S=5, 6^5 = 7776 including the calc model. There are calculated all possible T and S arrays that exists. The first part of formula deals with T grouping so if we have 4 groups like mentioned above were gonna have A(4,4) = 24, if 3 groups A(3,3) if 7 A(7,7) number of factors of the 1st part of formula. The second part deals with S array and it's being calculated as previously no changes. The final result will be the × between 1st and 2nd part of formula. When we have the whole puzzle image like bellow we can determine what I'm trying to propose a term called Hyperstring (*see 2nd img).

rthbrth

For the record one of the OEIS editors proposed this form for the 1st part of formula (dealing with T grouping)

bnhgvbf43

And here the Hyperstring thing. Each of those 6^5 = 7776 arr. belongs to 10 S pattern relative to a T pattern ( a bit ambiguous here). Why 10? In this case there are 11 ways to sum to T=6 but because one of the way means putting all of them in a large pile we don't need to count that as all of them have it. Those that are more numerous no matter what and can not be singled out in any way shape or form I proposed calling them Hyperstrings. Those highlighted in red rectangle are the weakest. Because I've used low numbers 240 doesn't look like much but if we scale things up in 94^16 or 94^63 Hyperstrings must be the holy grail of pwds. Currently I have no formula to calculate them. Here they are:

nhgtr

Worked on a script with the help of Artix community that filled the black in my python skills .. This script you can edit T groups depending on liking, it counts and categorizes a text file full of strings. Can be used to establish the best S array for a certain T grouping counting which one shows more often. This is true till we hit big numbers aspect that I've previously talked for a bit.

https://github.com/B1u3l1ght/Old-School-Pass/commit/4c3bb233e932222dd2861ae526c71e27bf05753b

Hope you'll gonna find this interesting 🤞🏻trying to make this as a Christmas/break time gift for the open source community.

Peace, Blu3l1ght ✌🏻

B1u3l1ght commented 9 months ago

Forgot to add a larger example when T=94, S=16 Here's the mathematical proof that when we need a 16 long pwd it's better to choose: 6 symbols, 5 letters, 4 low cap, 1 number.

We can choose the other way around as long as we keep 6541 grouping however we can see the bigger numbers are behind the above (6 symbols, 5 letters, 4 low cap, 1 number).

Upper right corner (green rectangle) we have confirmation about that grouping as being the strongest.

Additional letters were used so to manage more easily those powers. You can see there are 4 groups so A(4,4)=24. So this 'base' 32 26 26 10 (the grouping) will show up 24 times but each time different till exhaust all those 24 possibilities that arise.

6b4x0

Blu3l1ght✌️

jirka-h commented 9 months ago

Hello B1u3l1ght,

Happy New Year to you, too!

I really like this new approach. This addresses the common need to have a certain character classes in the password.

I have created a script similar to your z.py to count the frequencies of characters in each category for randomly generated passwords. Results are matching yours! Find the script bellow.

Now, here is the outline of the password generator:

  1. Create random passwords and find out how many letters in each category are required to get the highest possible entropy/randomness.
  2. Take at random one of top categories from step #1. When picking up the category, take into account the relative frequencies. (First combination will be more probable than the 10th combination). Example:
    (('symbols', 6), ('upper', 5), ('lower', 4), ('digits', 1)): 152 lines (1.52%)
    (('symbols', 6), ('lower', 5), ('upper', 4), ('digits', 1)): 145 lines (1.45%)
    (('lower', 6), ('symbols', 5), ('upper', 4), ('digits', 1)): 136 lines (1.36%)
    (('symbols', 6), ('lower', 5), ('upper', 3), ('digits', 2)): 130 lines (1.30%)
    (('symbols', 7), ('upper', 5), ('lower', 3), ('digits', 1)): 121 lines (1.21%)
    (('symbols', 6), ('upper', 5), ('lower', 3), ('digits', 2)): 120 lines (1.20%)
    (('upper', 6), ('symbols', 5), ('lower', 4), ('digits', 1)): 118 lines (1.18%)
    (('lower', 6), ('symbols', 5), ('upper', 3), ('digits', 2)): 113 lines (1.13%)
    (('symbols', 7), ('upper', 4), ('lower', 3), ('digits', 2)): 104 lines (1.04%)
    (('symbols', 7), ('lower', 5), ('upper', 3), ('digits', 1)): 103 lines (1.03%)
  3. Let's say we picked combination (('symbols', 6), ('upper', 5), ('lower', 4), ('digits', 1). Now generate random strings for each category. WE KNOW how to do this - just use your formula to get the best possible frequencies of individual characters:-) Luckily, we have not to deal with huge numbers anymore. In the example above S=6 and T=31 for symbols, S=5 and T=26 for capital letters, etc.
  4. We have now parts of the password ready. Combine them into one string and do one random permutation over its chars to get a final password.

This way, we can generate strong passwords and make sure that we have all categories as needed at the same time.

I have a very good feeling about this. I have chunks of the code ready, now I have to integrate it.

Let me know what you think. I will give it now a higher priority and will come up with first version of the code in a week or so.

Jirka

Script to count the frequencies of number of chars in each category.

#!/usr/bin/env python3

import sys
from collections import Counter

#https://github.com/tytso/pwgen/blob/master/pw_rand.c#L16C1-L21C42
pw_digits = "0123456789"
pw_uppers = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
pw_lowers = "abcdefghijklmnopqrstuvwxyz"
pw_symbols = "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
#pw_ambiguous = "B8G6I1l0OQDS5Z2";
#pw_vowels = "01aeiouyAEIOUY";

categories = {"lower": pw_lowers, "upper": pw_uppers, "digits": pw_digits, "symbols": pw_symbols}
char_map = {}
for i in range(128):
    if i < 33 or i > 126:  # definition of [[:graph:]]
        continue
    c = chr(i)
    for k in categories:
        if c in categories[k]:
            char_map[c] = k

# Initialize an overall counter for frequencies
overall_category_count = Counter()

total_lines = 0
for line in sys.stdin:
    line = line.strip()
    if not line:
        break
    category_count = Counter(char_map[char] for char in line)

    # Sort the items of the Counter by counts in descending order
    sorted_items = sorted(category_count.items(), key=lambda x: x[1], reverse=True)

    # Convert the sorted items to a tuple for use as a key
    key = tuple(sorted_items)
    overall_category_count[key] += 1
    total_lines += 1

# Sort the results by frequency in descending order
sorted_results = sorted(overall_category_count.items(), key=lambda x: x[1], reverse=True)

# Print the top 10 and bottom 10 results with percentage
print("Top and Bottom 10 Overall Category Frequencies (Sorted by Counts with Percentage):")
for category_count, frequency in sorted_results[:10]:
    percentage = (frequency / total_lines) * 100
    print(f"{category_count}: {frequency} lines ({percentage:.2f}%)")

print("\n---\n")

for category_count, frequency in sorted_results[-10:]:
    percentage = (frequency / total_lines) * 100
    print(f"{category_count}: {frequency} lines ({percentage:.2f}%)")

Usage and sample output for 10,000 random passwords:

$(cat /dev/random | tr -cd '[:graph:]' | fold -w 16 && echo ) | head -10000 | ./calculate_freq.py 
Top and Bottom 10 Overall Category Frequencies (Sorted by Counts with Percentage):
(('symbols', 6), ('upper', 5), ('lower', 4), ('digits', 1)): 175 lines (1.75%)
(('symbols', 6), ('lower', 5), ('upper', 4), ('digits', 1)): 172 lines (1.72%)
(('lower', 6), ('symbols', 5), ('upper', 4), ('digits', 1)): 148 lines (1.48%)
(('symbols', 6), ('upper', 5), ('lower', 3), ('digits', 2)): 141 lines (1.41%)
(('symbols', 6), ('lower', 5), ('upper', 3), ('digits', 2)): 131 lines (1.31%)
(('symbols', 7), ('upper', 5), ('lower', 3), ('digits', 1)): 122 lines (1.22%)
(('lower', 6), ('upper', 5), ('symbols', 4), ('digits', 1)): 118 lines (1.18%)
(('symbols', 7), ('upper', 4), ('lower', 3), ('digits', 2)): 116 lines (1.16%)
(('upper', 6), ('symbols', 5), ('lower', 4), ('digits', 1)): 116 lines (1.16%)
(('upper', 6), ('lower', 5), ('symbols', 4), ('digits', 1)): 107 lines (1.07%)

---

(('digits', 4), ('upper', 4), ('lower', 4), ('symbols', 4)): 1 lines (0.01%)
(('upper', 4), ('digits', 4), ('symbols', 4), ('lower', 4)): 1 lines (0.01%)
(('upper', 9), ('symbols', 6), ('digits', 1)): 1 lines (0.01%)
(('digits', 5), ('lower', 5), ('upper', 3), ('symbols', 3)): 1 lines (0.01%)
(('lower', 9), ('upper', 5), ('digits', 2)): 1 lines (0.01%)
(('upper', 4), ('symbols', 4), ('digits', 4), ('lower', 4)): 1 lines (0.01%)
(('digits', 5), ('lower', 5), ('upper', 4), ('symbols', 2)): 1 lines (0.01%)
(('symbols', 9), ('upper', 4), ('digits', 3)): 1 lines (0.01%)
(('upper', 7), ('lower', 5), ('digits', 4)): 1 lines (0.01%)
(('upper', 6), ('digits', 5), ('symbols', 5)): 1 lines (0.01%)
B1u3l1ght commented 9 months ago

Looks great👍🏻. But it will take a bit more till I can analyze it properly again math layer only as I'm absolutely sure you're doing a great job not only code related but mathematically wise too.👍🏻 Had some more analogical work to do lately but glad I could sneak this newer approach and check your great work too. I'll keep an eye on it and ask you when I'm having any ambiguities but from the birds eye view it looks great. Hope we can create a solid standard here or at least a good start for it as there were lots of suppositions on this illusive topic and saw even high profile personalities that said many cloudy and unrealistic things on these matters.

Have a great weekend!

Blu3l1ght ✌🏻

B1u3l1ght commented 9 months ago

Yep looked more closely and it looks OK. Now we should not get scared much about big numbers we can make script print results in scientific (decimal) format like let's say 3.23×10³⁷ vs 3.781×10³⁷, anyone can figure the 2nd number is bigger. Script can be left to run in the background a predefined time (10/30/60 second) chosen by the user then script can retain bbp that scores best.

When we reach higher T and longer S relative to T we start getting slightly into quantum math. Patterns do not hold as much vs others but gets thinner so it's harder to put finger on which one is the best.

This phenomena let's call it this way starts from about 94^40 and gets worse and worse when reach 94^94. But I'll try to figure out exactly when this 'quantum bang' happens. Around this issue I have an explanation. When big T&S are used patterns are so many and they start 'grouping' for example patterns with two of threes 33222111.... are stronger than say 3222211... etc. So for big numbers we have to group patterns themselves. We will no longer have strongest patterns but strongest groups of patterns but that doesn't mean there isn't in there a stronger a bbp that can be deduces via my introduction of this reply.

I'm gonna try using low cap t and s from now so math guys not get annoyed 😇. I'll try do the calculator with the new broader formula and maybe we can integrate it into your script or simply keep them separated.

Have you thought about ASLR if we can integrate some things we talked here? I mean I don't think anyone would be happy to have one of its root, core processes allocated on addresses that can be easily guessed like those highlighted in red rectangle. nh43

Cheers ☮! B1u3l1ght✌🏻