shane-mason / essential-generators

68 stars 8 forks source link

Feature Request: set a random seed, get deterministic output (kinda works already) #1

Closed MatrixManAtYrService closed 3 years ago

MatrixManAtYrService commented 3 years ago

Hi, I'd like to use your (excellent) library to generate test datasets in a deterministic way.

I'm puzzled by my findings: It appears to be deterministic for short strings, but after a certain length, the random seed gets lost. I wrote a small script to show you what I mean:

import random
from hashlib import md5
from essential_generators import DocumentGenerator

g = DocumentGenerator()

def compare(depth, seed=None):

    print(f"{depth} characters deep:")
    left = []
    right = []

    def go():
        payload = []

        # reset the random seed if specified
        if seed:
            print(f"using seed {seed}:")
            random.seed(a=seed)

        # generage five paragraphs
        # but only analyize their leading substrings
        for _ in range(5):
            m = md5()
            m.update(g.paragraph()[0:depth].encode())
            payload.append(m.hexdigest())
        return payload

    left = go()
    right = go()

    for l, r in zip(left, right):
        print(l, r)

# no expectation of equivalence
print('no seed init:')
compare(15)

# expect equivalence and find it always
print('\nwith seed init:')
compare(15, seed=1)

# expect equivalence and find it sometimes
print('\nwith seed init:')
compare(35, seed=1)

# expect equivalence and find it never
print('\nwith seed init:')
compare(100, seed=1)

Here is the output:

no seed init:
15 characters deep:
99cd370a256a08c0935df07588a9d149 5712ca694a784b7de849030127b5f8bf
51f9589d4c3eeaac7beaffab1bd4aabe 302ef08d6af8864527c166c50b41e9b6
c8f6a33ab25cc4d36c1929325d10bd1e 1b8736fc0574498b0485242c2037c433
36af5f84a2564a557bfab9bfb43d0aee c53d1f3bb0684b2fc648a28daf954e56
ef2386e0decfcf4275fa11497d72a934 3744e106265e39d25d9b4b9b1860674f

with seed init:
15 characters deep:
using seed 1:
using seed 1:
c91ab6b5cf81131dabe56fb1128d2819 c91ab6b5cf81131dabe56fb1128d2819
52503d52c197f3e09698885a295f6944 52503d52c197f3e09698885a295f6944
57befb45c97732d0847ad512d75a4310 57befb45c97732d0847ad512d75a4310
adeae47e78fdba591212906ca5b13eb2 adeae47e78fdba591212906ca5b13eb2
19df57795b6f5020885bf04f8a677d9d 19df57795b6f5020885bf04f8a677d9d

with seed init:
35 characters deep:
using seed 1:
using seed 1:
b4e90b4567af47ca36b26093d7ee0d45 b4e90b4567af47ca36b26093d7ee0d45
1dc7d702222970c395ee7124326938b9 1dc7d702222970c395ee7124326938b9
ecc3955fc825be75aaa4100dca837f46 03ff6e24ecf360ef080cee5ee83439e6 <--- huh?
f98d0f54fc27985d055bc959d395955c f98d0f54fc27985d055bc959d395955c
452d9069d8e02fcf1b85fac3afa730ee bbaf39bccf8db61ca515b7821a5ea34e <--- huh?

with seed init:
100 characters deep:
using seed 1:
using seed 1:
5b642f729efb10cefa35dcb3241d8f9e b50950a765dfd6fbc73a07627924129e
e0e63f320af5c37d018e048406a8c653 76580ed2a943c5e92ee4ed9bcbf738c0
14e2456f05ff42248d5a39239771bc3b 8d35636e43182866dce4ce9e761b7f64
186e426a961868df3b7942f3d1fca404 5ffa91b250be0e69a35ac44995a02f5f
f0088047f353cd394a0667fa5dfac253 0a1b79ed08951dad813aa7ce1b399a49

So I guess this is a feature request: Can you add a way to supply a randomness seed so that it can be made to produce the same pseudorandom output every time?

If you don't feel like it, can you give me a hint about where the seed is being forgotten? Then I'll take a crack at it in a fork.

Thank you.

shane-mason commented 3 years ago

This is a great feature suggestion. I can work on this and provide an update.

shane-mason commented 3 years ago

When this was initially released, python 3.5 did not contain the random.choices method that accepted weights and returned k-length list so I added a preview method from the unreleased 3.6 to emulate that function. Now that the random.choices is generally available, I updated to use the builtin method. This took care of losing the random seed issue you were having above. The script you provided now prints the same hash for both left and right. I just committed this to git and will update the python module on pypi soon,

MatrixManAtYrService commented 3 years ago

That's fantastic, thank you!