hylo-lang / hylo

The Hylo programming language
https://www.hylo-lang.org
Apache License 2.0
1.15k stars 53 forks source link

Consider using ordered data structures only #1480

Open kyouko-taiga opened 1 month ago

kyouko-taiga commented 1 month ago

It is typically desirable for a compiler to produce the same output for a given input across multiple compilation runs. Some even advocate that two runs with the same input should always produce bitwise identical outputs.

We get close in hc because mangling produces predictable names and because the order in which we insert definitions in LLVM modules is based on the order in which the declarations were parsed. So while it's difficult to "predict" which function will appear first in the generated output, at the very least it is stable.

But if the LLVM output is ordered, none of the previous phases' artifacts are. One reason is that we use many sets and dictionaries. I think we should revisit this choice. In particular, I think it would be nice if a compilation run was guaranteed to take the same execution path every time. That's a good property for debugging.

I guess the only way to achieve that is to use ordered data structures everywhere. That will probably add some space and time costs, though, so we could also consider using simple arrays in some places. That's because many of the sets we use are so small that maintaining a hash table is probably more expensive than doing linear searches. Of course we'd need to profile this wild assumption.

dabrahams commented 1 month ago

only way to achieve that is to use ordered data structures everywhere.

We can still use hashed data structures but we might need to write them ourselves (or just duplicate Swift's implementations) to bypass hash randomization.

Adding concurrency can also cause nondeterministic orderings if you're not careful. It certainly can cause different execution paths as soon as you have anything like a thread pool.

Metalymph commented 3 weeks ago

Hi, you made me be curios about and I have just played around with a Clock based basic test; in results.me there's a possible generated result (compiling and executing can be generated again). It seems like Arrays and Hashed structures (Dicts or Set etc) behave as expected with comparable performance, so probably using Arrays is not the key. Could be the use of custom hashing (especially using perfect hashing functions when possible) a solution? About concurrency, correct me if I'm wrong but, won't Swift 6 provide an isolated function for guaranteeing a deterministic execution approach? From what I remember it's a feature part Actors area.

kyouko-taiga commented 3 weeks ago

What are your results showing for relatively small data structures? Because most of our collections are pretty small; under 50 elements. Last time I checked with micro-benchmarks, hashed data structure were really not shining in such use cases.

We're not using Swift's actors in the compiler. Perhaps we should in the future but I'm under the impression that it would require rewriting pretty much everything with an async keyword. I'm hoping to keep using the less invasive approach that we've successfully used so far (modulo the UB caused by too deep stacks).