ollef / sixten

Functional programming with fewer indirections
BSD 3-Clause "New" or "Revised" License
758 stars 26 forks source link

Add string literals as primitives to improve string compilation speed #112

Open ollef opened 6 years ago

ollef commented 6 years ago

On my machine

x = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

takes almost 8 seconds to compile. This obviously won't do!

ollef commented 6 years ago

The problem here is that the vectors that strings desugar to contain implicit type arguments of size on the order of the length of the string for each element in the vector, meaning that the term's size is quadratic in the length of the string.

The commit https://github.com/ollef/sixten/commit/2be98a06fc71ea415cdcc1f2ea2a74e5216e91ac improves the example down to about half a second, because it allows the big type arguments to be constant folded after type-checking, which in turn means that LLVM is no longer fed thousands of lines of pointless type calculations that it's slow (but good) at optimising.

We still have quadratic terms during typechecking though, so maybe we should just add strings as a primitive literal instead of desugaring them to arrays/vectors.

ollef commented 6 years ago

We might be able to use sharing to fix the quadratic size of the terms by maximally sharing the length parameters, which would take them down to linear size again.

AnthonyJacob commented 5 years ago

You might be interested in Quantitative type theory.

McBride [ 25 ] has recently proposed a resolution to this conflict by combining the work on erasability and quantitative types. His insight is to use the 0 of the semiring to represent information that is erased at runtime, but is still available for use in types (i.e., extensionally). ... In this paper, we fix and extend McBride’s system, and present semantic interpretations that fully exploit the usage information.

ollef commented 5 years ago

Thanks, that's indeed generally relevant research to Sixten.

To clarify, the runtime representation isn't the problem in this issue, and I don't think erasure would help. Strings are compiled to flat arrays of characters, as you'd want.

IIRC what's currently taking the most time is that we compute the type of every subterm during desugaring. This is done so that the code generator has easy access to the runtime representation (size in memory) of any subterm. This is quadratic for size-indexed structures like vectors, even though it's simplified away to constant integers in the end. These types are not generally erasable since Sixten uses them at runtime, though I think we could be more clever about trying to not evaluate them as often as we sometimes do. Perhaps we could e.g. find a scheme that only computes the size of the whole vector and not every subterm as we do now.