marimo-team / marimo

A reactive notebook for Python — run reproducible experiments, execute as a script, deploy as an app, and version with git.
https://marimo.io
Apache License 2.0
6.81k stars 241 forks source link

Likely a 750 cell limit due to cell id hash collision #1962

Open dmadisetti opened 2 months ago

dmadisetti commented 2 months ago

Description

Was just going through my notes and I came a comment about this line:

https://github.com/marimo-team/marimo/blob/45787483f8ce88a11067230f7edd37cee000df1d/marimo/_ast/app.py#L416

Which at the time struck me struck me as a pretty small ID space to be collision free for long. You can easily check when the first collision is:

import string
import random

random_seed = random.Random(42)

# 4 random letters
hashed = set()
while (n := "".join(random_seed.choices(string.ascii_letters, k=4))) not in hashed:
    hashed.add(n)

print(len(hashed))
print(n)

Suggested solution

42 is just a little unlucky in this context

seed = 4 gets you x10 at 7908 before collision and seed = 93074 gets you 12735 but still very shy of the 6497400 possible cells (i think as a consequence of the birthday paradox)

Alternative

I don't really think this is needed. 750 cells is massive- just a curiosity but a potential issue in the future. A future solution could be to skip duplicates for backwards compat, or change the letter set at this point.

Additional context

I haven't tried, but just changing the seed will likely break a ton of tests / create unneeded churn, and might impact existing island exports. Just sharing.

akshayka commented 2 months ago

I think this is good to flag. Some of our users have extremely long notebooks.

A future solution could be to skip duplicates for backwards compat, or change the letter set at this point.

I'm fine with either of these.