Gensysms aren't good enough

gilch commented 1 year ago

Gensysms currently use a simple counter prefix to ensure uniqueness. It's at least locked to prevent thread races from accidentally reusing a number, but this doesn't help if someone restarts the interpreter! If all compilation were guaranteed to happen in the same session, this could work, but I can't make that guarantee and keep the standalone property, a major design goal of Hissp.

To illustrate: a standalone library could be precompiled to Python, distributed without source code, and (in an application) combined with another such library that would certainly re-use the same counters (being compiled by a separate interpreter starting the count at 1 again). Gensyms quite often have single-letter names, making a collision well within possible when counters are reused, leading to some uncommon, but very annoying bugs.

The obvious solution would be to use a UUID instead of a counter. This would work but seems awfully heavy-handed. Compiled code needs to remain human readable for debugging purposes. Gensyms are hard enough to read as they are, but you get used to them, and you could type them into a debugger if you had to.

A UUID would make them quite a bit longer, and the addition would be pure noise. The length issue could be mitigated somewhat by rendering in base32 instead of hex. Base64 would be better, but +/ aren't valid in identifiers. Those could probably be replaced with 9_ without losing too much entropy. I could probably just chop off bytes and use a much smaller space without issue, but I'd have to think carefully about how much. It might also be possible to use higher Unicode, but that seems complicated. That also risks terminals and debuggers not being able to display things properly and would make typing in identifiers more difficult and error-prone, although a pure ASCII alphanumeric identifier would also have that issue to some degree if it's mostly noise. Not ideal.

gilch commented 1 year ago

The next solution I came up with would be for the gensym prefixes to include their module name and a template counter. There's no guarantee that modules will be compiled in the same session (and in some perfectly reasonable use cases, they likely won't be) but we can expect that all forms within a file would be. Module names are also not likely to collide when used together, or you'd have a hard time importing them.

However, it's still possible that a module providing macros could go through several revisions, which wouldn't change the name. Code compiled with an old revision could interact with code compiled with a new one. This seems easy enough to avoid in your own code (Just do a clean build. You'd do that anyway, right?) but libraries distributed without source could have transitively depended on different revisions of the same macro provider that they didn't include. Ugh.

Seeding the counter with a timestamp or something could mitigate the issue, although it's not clear how much precision would be required, and high enough precision would significantly add to the length. We'd also be giving up reproducible builds, which isn't a stated design goal of Hissp, but maybe ought to be. A UUID would also have this problem, come to think of it.

And that still leaves __main__, a module name that everyone uses. Hissp is just data structures. The compiler doesn't require you to include a file, even though the reader would like it. You don't usually precompile main as main though. Maybe this isn't an issue.

gilch commented 1 year ago

How does Clojure deal with this? I think it doesn't? AOT compilation is the exception there, not the norm. I don't think it does any better than my module name + counter idea.

Older Lisps use symbol interning to ensure that gensyms are unique. I think Clojure can't do this because it's hosted.

Hissp is hosted too. Symbols are kind of a fiction in Lissp, and don't exist in Hissp, where they're just strings. Python does intern strings, and CPython (at least) pretty aggressively interns identifiers, but this is just an optimization and there are no guarantees. I wonder how easy that is to check. Perhaps Lissp could avoid making a gensym if it's already an interned string. It's also possible to generate symbol tables when you have source. I'm not sure how much any of this helps.

gilch commented 1 year ago

PEP 552 – Deterministic pycs changed the .pyc format to use a hash instead of a timestamp to enable reproducible builds. That seems like a viable solution.

Template count, module name, and module hash should be sufficient. And actually, the module name is redundant given the hash, though it's not particularly human readable. The Lissp Lexer gets the entire code string as an argument, so the Lissp reader should be able to call hash() on it, or something from the hashlib, although that's probably not necessary; CPython's hash() would use siphash24 for strings, which has pretty good characteristics.

Gensyms probably don't need to use all 64 bits either. Truncating to 8 hex digits is usually good enough for Git, although a few extremely large projects need 12 now. I think I could also XOR that with the counter to avoid increasing the gensym length any further than that. It would take fewer than 8 identifier characters to encode 8 hex digits.

gilch commented 1 year ago

OK, experimenting a little, it looks like CPython randomizes the hashes. You can set PYTHONHASHSEED if you need the reproducible builds. Do I want to make the users figure that out? Maybe not. Extra randomization might make accidental collisions even more unlikely though. I could try something from hashlib instead. I thought siphash24 was a pretty good choice. I wonder if there's some other way to access it. Maybe one of the others has suitable characteristics.

Also, the Lexer doesn't have the whole module. It has the current top-level form. Is this a problem? It means different revisions of the module might have top-level forms that haven't changed, even though code in the same module using them might. That means the module name is technically not redundant.

I could put the counter and module name in a tuple with the form code and hash that instead. That way the module name at least goes into the hash. I could keep the count and name human-readable and separate, although that would add considerably to the length, it might not hurt readability much. Or I could compute the code hash in the Lissp.compile() function. That still doesn't work in the REPL though. Maybe that's fine.

gilch / hissp

Gensysms aren't good enough #210