Inch toward the finish line: As Seen From & Scala std lib processing

gkossakowski / kentuckymule

Limits of Scala typechecking speed

https://medium.com/@gkossakowski/can-scala-have-a-highly-parallel-typechecker-95cd7c146d20

Other

156 stars 10 forks source link

Inch toward the finish line: As Seen From & Scala std lib processing #6

Open gkossakowski opened 6 years ago

gkossakowski commented 6 years ago

Background read: Can Scala have a highly parallel typechecker?

This is a meta issue with a quick overview of all tasks I came up when I thought of adding support for the As Seen From (ASF) algorithm and for processing the Scala std lib to Kentucky Mule.

I'm following the format from my work on zinc that turned out to be successful in managing the complexity of a difficult project I worked on in the past: https://github.com/sbt/sbt/issues/1104 It shipped as part of the sbt 1.0 release recently.

This ticket describes the second stage in Kentucky Mule's development. The first stage was about validating whether the ideas for rethinking compiler's design with focus on performance are in touch with how fast computers actually compute. The second stage is about confronting Kentucky Mule's ideas with what I consider the most tricky aspects of Scala's design to implement in a performant way.

I'm planning to update this issue with interesting findings and roadblocks I hit working towards finishing the tasks listed below. My intent for this issue is twofold:

keep me organized in marching towards adding support for processing Scala's standard library in Kentucky Mule
give busy but curious passerby a place to track the progress of Kentucky Mule's evolution

Context

In my most recent blog post on Kentucky Mule, I wrote:

Adding support for processing Scala’s standard library in Kentucky Mule would be a good trial for the As Seen From treatment. The standard library makes a heavy use of mix-ins, abstract types that are refined multiple times in the inheritance chain and type constructors. These are the features that remain the riskiest for the architecture to support.

I think adding support for Scala standard library to Kentucky Mule should take another 15 days of deeply focused effort. And I think it would be a really captivating project to work on. Once that is done, I think Kentucky Mule would become a complete proof of concept for a highly parallel Scala compiler architecture.

I'm breaking up the work required for adding the As Seen From and some other missing Scala language features to Kentucky Mule into a list of smaller tasks. The list is not meant to be set in stone and will get refined as more issues and tasks come to light.

Once I surfaced the list, I realized the scope is a bit larger than I originally speculated. I'm revising previous prediction of 15 days of deeply focused effort to complete this project and bump it to 20 days.

Tasks

Missing language features

[x] monomorphic type aliases: type T = String
[x] polymorphic type aliases: type Foo[T] = List[T]
[x] package objects
[x] constructor multiple param list
[ ] def multiple param list
[ ] method dependent types
[x] resolution of this in paths
[ ] type bounds for type members
[ ] type bounds for type parameters
[x] implicit import of Predef to every compilation unit
[x] implicit import of scala and java packages to every compilation unit
[x] support for super in paths
[x] singleton types
[ ] type projections
[x] tuple types (contributed by @trane in #7)
[ ] function types
[x] import renames
[x] self-types
[ ] method-dependent types

Implementation features

Features that are not strictly language features but need to be implemented for other reasons.

[x] empty package handling (surprisingly hard to get right)
[x] cycle detection between completers

As Seen From

Surprisingly, Kentucky Mule implements some aspects of the ASF already. E.g. Rule 3 for the type parameters is partially supported. I don't have a good intuition what's missing even when I have ASF's precise definition in front of my eyes. For performance reasons, Kentucky Mule implements ASF's features differently from how they're specified so simple diffing between the implementation and the spec doesn't cut it. I'll come back to this section as the implementation of the other language features mentioned above progresses.

Status

I haven't touched Kentucky Mule for almost a year and I'm picking it up now to work on tasks in this ticket continously until I check all the boxes. Kentucky Mule remains my side project so I aim for a steady but slow progress done over the course of some of my evenings.

I implemented the original, first stage of Kentucky Mule in a short span of time when I was working on it full time. I'm curious if my 20 days prediction of completing this project will hold when the days are broken into a larger number of evenings.

smarter commented 6 years ago

From past experience, I can tell you that one of the most difficult part of compiling the collections in the standard library is proper support for F-bounded polymorphism, there are potential cycles everywhere that need to be avoided like minefields, some examples here: https://github.com/lampepfl/dotty/pulls?utf8=%E2%9C%93&q=is%3Apr%20is%3Amerged%20cyclic%20reference%20

gkossakowski commented 6 years ago

Interesting! Thanks a lot for the reference. I suspect I might get away with not stepping into the F-bounded polymorphism minefield. The F-bounded polymorphism is where parametric polymorphism and subtyping meet. However, outline types deliberately do not support subtyping checks so in Kentucky Mule F-bounded polymorphism collapses into the regular parametric polymorphism. I'll keep an eye on this while working on the implementation, though.

gkossakowski commented 6 years ago

I just pushed the implementation of package object in Kentucky Mule! Implementing it gave me an interesting insight into the cost of this language feature. I'm copying my notes from notes.md with all details.

Package objects

Initially, Kentucky Mule didn't support package objects. I felt that they're not an essential Scala feature and implementing package objects would be tedious process but not affecting the design of Kentucky Mule in substantial way. Now, I'm looking into implementing support of package objects and I see they're tricky to implement. The complexity boils down to combining two features

openness of packages
inheritance from parents in objects

This makes it pretty tricky to compute all members of a package. Without package objects support, list of members in a package is available right after entering of all symbols is done: members of a package can be determined purely from lexical scoping extended across all source files. Previously, I implemented lookup in package by simply looking at its scope but once package objects get involved, member lookup has to go through a type. Packages need to receive support for type completers as a result.

My current plan is to have two different paths for completing package type:

no-op type completer for packages that doesn't have a package object associated with them
a type completer that merges members declared lexically in package with members computed by a type completer for package object

The end result will be that lookups through package type will be as performant as the current implementation that accesses package scope directly.

Nov 5th, 2017 update

I implemented the package object support via adding package types as described above. I experimented with two different designs for looking up members in packages:

In package type have two phase lookup: a) lookup in package object's members (if there's a package object declared for that package) b) if member not found in package object, check package declarations
Have package type completer collapse (copy) members from package object and declarations of package into one Scope. Lookups are probing just that one scope.

The tradeoff between 1. and 2. is the performance cost of PackageCompleter's vs lookups in the package.

The 1. optimizes for faster package completer: if there's no package object declared, package completer is noop. However, package member lookups are slower because two scopes need to be queried.

The 2. optimizes for faster package lookups: all members are collapsed into one Scope instance and package lookups is simply one operation on a Scope. The price for it is slower completer: it has to copy all members from package Scope and from package object Scope into a new scope.

Initially, I thought 2. is too expensive. Copying of all members sounded really wasteful. However, benchmarking showed that the overall performance of completing all types in scalap is better with the second strategy. In retrospect it's actually not surprising: copying things is pretty cheap with modern CPUs but handling branches and two Hashmap lookups continues to be expensive. Also, one performs many more lookups in package compared to a single completion of the package.

Package object performance cost

Even with the very careful analysis of all options I could think of, I didn't manage to find a design that would make package object implementation not expensive. Let's look at the numbers:

Before package object implementation (d382f2382ddefda3bebc950185083d6fa4446b27)
[info] # Run complete. Total time: 00:09:17
[info] Benchmark                            Mode  Cnt      Score     Error  Units
[info] BenchmarkScalap.completeMemberSigs  thrpt  120   2169.029 ±  11.857  ops/s
[info] BenchmarkScalap.enter               thrpt  120  12564.843 ± 205.093  ops/s

Package object implemented
[info] Benchmark                            Mode  Cnt      Score    Error  Units
[info] BenchmarkScalap.completeMemberSigs  thrpt   60  2046.740 ± 21.525  ops/s
[info] BenchmarkScalap.enter               thrpt  120  12454.014 ± 96.721  ops/s

Enter performance is about the same. However, completeMemberSigs loses ~130 ops/s which is around 6% performance drop. I think for such a fundamental feature as package objects, the cost is somewhat accepatable.

I was curious what's the source of slow-down, though. One way to look at is the size of the completer's queue. Adding package completers increases the number of tasks from 930 to 974 (this is 4.5% increase). I also experimented with running package completers but still relying on the old package lookup directly in symbol's scope instead of going through it's type. I found that additional indirection of package member lookups (going via type) in lookupOrCreatePackage is responsible for about 1.5% performance drop. The rest of it is attributed to more complex logic in enterTree (that handles package objects) and to running package completers that need to copy members.

gkossakowski commented 6 years ago

master now contains support for the empty package. Surprisingly, the empty package handling in Scala is a very good example how to not design software. The empty package is described in the Scala specification 9.2:

Top-level definitions outside a packaging are assumed to be injected into a special empty package. That package cannot be named and therefore cannot be imported. However, members of the empty package are visible to each other without qualification.

Reading this description, empty package (packaging) seem to be a straightforward and a small technical detail of Scala. I thought I'd implement it quickly and move onto more interesting problems. I ended up spending a fair amount of time debugging what turned out to be an odd behavior of both scalac and dottyc. I added a 70 lines long comment explaining my original wrong assumptions and my attempts at fixing my implementation and the eventual surrender to an ugly hack.

Let's look at an example:

// A.scala
class A
package foo {
  class B
}

reading the spec, one would expect this to be interpreted as:

// A.scala
package <empty> {
  class A
}
package foo {
  class B
}

however, both scalac and dottyc parse the original example as:

package <empty> {
  class A
  package foo {
    class B
  }
}

The rule the parser follows is: if you have at least one top-level definition outside of a package, everything is wrapped into an empty package declaration. This AST structure is slightly wrong: no package is allowed to be declared inside of the empty package. Both scalac and dottyc have a special handling in later phases for this strange ast structure. Eventually, I adopted the same special handling in Kentucky Mule. The code is one line but I felt it deserved a really long comment.

I wondered why would one wrap all declarations in the empty package declaration. I believe it's an example of a misguided sense of an economic behavior: cutting down on types of the ast nodes, unnecessarily. The parser is expected to return a single ast node, so multiple top-level declarations need to be wrapped into something. One solution is to introduce an ast node called CompilationUnit that is simply a container for all top-level declarations. However, I'm guessing both scalac and dottyc authors thought it would be ok to use the empty package declaration as a container. I lost a good couple of hours learning the consequences of optimizing the wrong metric.

gkossakowski commented 6 years ago

I just pushed implementation of import renames. Here're my notes:

Let's consider an example:

object A {
  object B
}
import A.{B => B1, _}

In addition to fixing lookup for B1 and B: B1 is now visible and B is not, my commit also fixed handling of the wildcard import. The subtle thing about the wildcard import is that it excludes both B and B1 from the list of imported members of A. That is, even if A has a member B, it won't be imported under its original name: it's been renamed to B1. If A has a member B1 it won't be imported either: it's shadowed by B renamed to B1. This is specced in SLS 4.7:

If a final wildcard is present, all importable members z z of p other than x1,…,xn,y1,…,yn x1, …, xn, y1,…,yn are also made available under their own unqualified names.

To implement wildcard as specified above, I have to remember if the name I'm looking for in a given import clause has appeared in any of its selectors. This is done while scanning selectors for a match. If I don't find a match, and if the name didn't occur in any of the selectors and import clause has a wildcard selector, I perform a member lookup from the stable identifier for the import clause.

Interestingly, I found a bug in scalac that allows rename of a wildcard to an arbitrary name:

scala> object Foo { class A }
defined object Foo

scala> import Foo.{_ => Bla}
import Foo._

// A is available because the import is a wildcard one despite the rename!
scala> new A
res1: Foo.A = Foo$A@687a0e40

// Bla is not available
scala> new Bla
<console>:16: error: not found: type Bla
       new Bla

Now, sadly, this commit introduces 5% performance regression for very unclear reasons. We went from around 2050 ops/s to 1950 ops/s. I tried to narrow it down. This simple patch restores most of the lost performance (brings us back to 2040 ops/s):

diff --git a/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala b/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala
index ed9e1fb5e1..ee75d3dc45 100644
--- a/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala
+++ b/kentuckymule/src/main/scala/kentuckymule/core/Enter.scala
@@ -596,14 +596,14 @@ object Enter {
           // all comparisons below are pointer equality comparisons; the || operator
           // has short-circuit optimization so this check, while verbose, is actually
           // really efficient
-          if ((typeSym != null && (typeSym.name eq name)) || (termSym.name eq name)) {
+          if ((typeSym != null && (typeSym.name eq name))/* || (termSym.name eq name)*/) {
             seenNameInSelectors = true
           }
           if (name.isTermName && termSym != null && (selector.termNameRenamed == name)) {
-            seenNameInSelectors = true
+//            seenNameInSelectors = true
             termSym
         } else if (typeSym != null && (selector.typeNameRenamed == name)) {
-            seenNameInSelectors = true
+//            seenNameInSelectors = true
             typeSym
           } else NoSymbol
         }
@@ -616,7 +616,7 @@ object Enter {
       //   If a final wildcard is present, all importable members z
       //   z of p other than x1,…,xn,y1,…,yn
       //   x1, …, xn, y1,…,yn are also made available under their own unqualified names.
-      if (!seenNameInSelectors && hasFinalWildcard) {
+      if (/*!seenNameInSelectors &&*/ hasFinalWildcard) {
         exprSym0.info.lookup(name)
       } else NoSymbol
     }

If you look at it closely, you'll see that the change is disabling a couple of pointer equality checks and assignments to a local, boolean variable. This shouldn't drop the performance of the entire signature completion by 4%! I haven't been able to understand what's behind this change. I tried using JITWatch tool to understand the possible difference of JIT's behavior and maybe look at the assembly of the method: it's pretty simple, after all. Unfortunately, JITWatch was crashing for me when parsing JIT's logs and I couldn't get it to work. I'm just taking a note and moving on feeling slightly defeated.

gkossakowski commented 6 years ago

This note is about how I think of staying true to my original Development principles, in particular being absurdly paranoid about the performance, and getting anything done.

While working on the package object support and import rename support, I encountered two different performance regressions that led me to developing some simple tactics and tools (right metrics!) that help me deal with performance losses efficiently.

I'll tell the backstory how I came about developing them.

Package objects

When I originally implemented package objects support, I saw almost 50% drop in performance of completing type signatures. I expected some performance hit. I explained in earlier comment the source of the slowdown. However, I was completely lost why this feature would cost 50% of my performance. I opened Yourkit profiler and tried comparing profiles before and after my change. They looked completely different which didn't make any sense. After all, I was benchmarking on scalap source as an input and that doesn't even have a package object declaration so the execution paths should be roughly the same!

"Was I hitting some pathological case in JIT?", "Was I lucky before because I was particularly JIT friendly and suddenly, the code became to complex to optimize it well?", "Is Kentucky Mule a house of cards that's falling in front of me?" were the thoughts I was having at the time.

Then I thought that I might be doing something expensive without realizing it. Out of curiosity, I checked the stats on how many completer tasks are executed during processing scalap sources. Before package object implementation, 930 tasks were executed and with package object implementation, around 1800 tasks were executed. Aha, so Kentucky Mule was doing twice as much work so 50% performance drop totally make sense. But why there are twice as many completion tasks? I added another stat so the output would look like this:

Number of jobs processed: 930 out of which 44 finished with dependency miss

And then for package object impl commit it look like this:

Number of jobs processed: 1800 out of which 890 finished with dependency miss

Aha, so we're suddenly missing a lot of dependencies. Packages now always have completers: package completers check for the existence of package objects to determine the final list of members of a package. So even though, scalap doesn't have package object declarations, all regular package declarations contribute additional completion tasks. "That should be at most a couple of completers more" I thought to myself. "And what about these missing dependencies?

Every other completer, directly or indirectly, depends on looking up something in a package. Here's what was happening. Let's pick a simplified example:

package foo

class A extends B
class B extends C
class C

The queue at the beginning is: [complete A], [complete B], [complete C], [complete foo]

The first completer tries to lookup A's parent, asks foo but foo is not completed yet so it returns a missing dependency result. The [complete foo] task is appended to the queue and followed by [complete A]. In that order, so we know the next time we try to complete A, foo had a shot at being completed. The queue now looks like this: [complete B], [complete C], [complete foo], [complete foo], [complete A]

The same goes for the second completer, and so on. We can see what's happening: package foo that is scheduled to be completed last is a dependency for everybody else. It forces the entire queue to run once without completing anything. We do twice as much work as a result! The actual details were more complicated but the gist is right there: packages should be manually scheduled to be completed first. I made a simple change to enforce that and both reported number of completers jobs executed and the actual performance numbers were back on track.

It's also clear why yourkit was useless in this case: the profile was indeed completely different, for half of the execution of the run, the results returned by completers were different. This caused the entire tree of method calls to have different timings and simply look incomparable. Profilers like yourkit are good at spotting local changes to the runtime profile but are really bad at helping with the global changes like the one I had.

For some time I've been thinking of Kentucky Mule to be an exploration of applying systems programming thinking to compiler design. This is the best story so far supporting that approach. Systems programmers are very good at coming up with units of computational abstraction that have predictable performance characteristics and give rise to thinking of performance in those units instead of raw nanoseconds. If it's a row in a file for MapReduce, or a row in a database, both are abstractions that can be reasoned about their performance in terms of their quantity.

For Kentucky Mule, that unit of abstraction is a completer together with the completer queue. These days, the first check I perform is looking at the size of my completers queue and dependency misses to spot check whether I didn't make any mistake. Having these stats is a delightful superpower in reasoning about runtime characteristics of my little system.

Import renames

Import renames introduced a mysterious performance regression by 5% and I couldn't figure out why. It was frustrating. I set out to be paranoid about every last bit of performance and I didn't want to give up on investigating this problem. It looked like some JIT problem that might be interesting to investigate and learn from. But JIT analysis tools are immature (they crashed for me). And what about 12 other tasks still waiting to be finished? When it's time to calls it quits because the ROI is not there?

I did a simple math. Let's assume that I have not 12 other tasks still waiting to be finished but 30 because I haven't discovered all remaining work. There's always more work than you think. Let's assume that I lose 5% of performance every time I implement one of the remaining tasks. The total performance would drop down to 0.95^30 ~ 0.2 of the current performance. That is: the cumulative slowdown would be 5x so not even an order of magnitude. Kentucky Mule processes scalap sources at ~4 million lines of code so even an order of magnitude performance drop would be still an amazing result if I had all Scala language features implemented. If I can't figure out the source of that 5% drop, I don't lose sleep over it.

This points to some larger truth when it comes to performance in software engineering: if you have a great start, you have a lot of surplus performance you can spend on non-essential aspects of the project.

This is the opposite of "premature optimization is the root of all evil". And it's ok. Kentucky Mule is all about pushing the limits of the performance and also getting something done.

gkossakowski commented 6 years ago

On January 17 of 2018, I finished a redesign of the completers queue. Now, I'm writing notes about the redesign and why is it really exciting.

The concern of completers queue is scheduling and an execution of jobs according to a dependency graph that is not known upfront. The dependency graph is computed piecemeal by executing completers and collecting missing dependencies. Only completers that haven't run yet are considered as missing dependencies.

In my early notes, in the "Completers design" section, I described the origins of the cooperative multitasking design of completers and I speculated what capabilities that design could unlock.

The redesigned completers queue takes these earlier speculations and turns them into an implementation.

The queue:

tracks completers' dependencies that are returned upon their execution
schedules completers to execute
detects cycles in completer dependencies at almost zero cost(!)

To my astonishment, I came up with a design of completers queue that can safely run in parallel with a minimal coordination and therefore is almost lock-free. This is a very surprising side effect of me thinking how to detect cycles with a minimal overhead.

To explain the new design, let me describe the previous way of scheduling completers. During the tree walk, when I enter all symbols to symbol table, I also schedule the corresponding completers. Completers for entered symbols are appended to the queue in the order of symbol creation. The order in the queue doesn't matter at this stage, though. Once the queue execution starts, the queue picks up a completer from the head of the queue, runs it and can receive either of the two results:

completed type for a symbol
dependency on a symbol that doesn't have a completed type (its completer hasn't ran yet)

The first case is straightforward: the completer did its job and is removed from the queue. In the second case, we append to the queue two jobs:

the blocking completer that hasn't ran yet and blocks the completer we just picked up from the queue
the blocked completer

If we picked up from queue completer A and it returned the dependency on incomplete B, we first append B and then A to the queue. This guarantees that the next time we try to run A, it will have B completed already.

This very simple scheduling strategy worked ok to get Kentucky Mule off the ground. However, it can't deal with cyclic dependencies. E.g.:

class A extends B
class B extends C
class C extends A

Would result in the job queue growing indefinitely:

step 1: A B C
step 2: B C B A # appened B A
step 3: C B A C B # appended C B
step 3: B A C B A C # appended A C
...

Even if we didn't put the same job in the queue multiple times, we would get into an infinite loop anyways.

In Kentucky Mule cycles appear only when the input Scala program has an illegal cycle. The result of running the queue should be an error message about the detected cycle.

The new design of the queue keeps track of pending jobs off the main queue. A job named A is marked as pending when it returns a result that says A is blocked on another job B. A is taken off the main queue. Only once B is completed, the A job is added back to the main queue. This scheme has the property that if jobs are in a cycle, they will all be marked as pending and taken off from the main queue. The cycle detection becomes trivial. When the main job queue becomes empty, it either:

finished executing all jobs (all completers are done) in case there're no jobs marked as pending
it found a cycle in case there're jobs marked as pending

In the example above with the A, B, C classes we would have:

step 1: A B C # A marked as pending
step 2: B C   # B marked as pending
step 3: C     # C marked as pending
step 4:       # empty queue, pending jobs exist => cycle detected

If some blocking job does complete, its pending jobs need to be released back to the main queue. I implemented this by maintaing an auxiliary queue associated with each job. I also have a global count of all pending jobs that is trivial to implement and makes checking for a cycle simple. This is not merely a performance optimization: I don't have a registry of all auxiliary queues so it would be hard to find out if there're any pending jobs left.

I drew a picture that illustrates execution of jobs A, ... D with some dependencies. It shows how auxiliary queues work. The main queue is marked with double underline and an auxiliary queue is marked with a single underline. Here's the picture:

kentucky mule completers queue

The execution is in 8 steps. In each step, we take a completer from the head of the queue and run it. If it returns a result signaling it's blocked on another completer, we place it in that completer's auxiliary queue. E.g. in the first step the completer A is placed in completer B's auxiliary queue. In the forth step, we first time succeed in completing a completer. The completer D is marked with a squiggly line.

The exciting thing about this design is that we check for cycles only once during the whole execution of the job queue: when the main job queue becomes empty. Moreover, there's no dependency in the execution order of tasks in the main queue. This design permits the jobs to run in parallel at ~zero cost precisely because the cycle detection criterion is so simple. In Kentucky Mule all heavy works is done through completers queue that now can be executed with a high degree of parallelism.

Now here's the kick. I came up with Kentucky Mule with an idea that I can have a very efficient sequential phase in the compiler that unlocks parallelism in other phases. I draw a picture of Kentucky Mule as a one box in my earlier blog post. The new design of completers queue breaks down Kentucky Mule into many parallel boxes and pushes parallelism of the overall design to a place I didn't expect to reach. Once I realized that, I had to go for a run to channel the thrill of the discovery. It was that good.

My redesign abstracted away completers from the queue by introduction of QueueJob interface (a bit simplified compared to the implementation in the code):

trait QueueJob {
  def complete(): JobResult
  def isCompleted: Boolean
  val queueStore: QueueJobStore
}

The job returns a result that is an ADT:

sealed abstract class JobResult
case class CompleteResult(spawnedJobs: Seq[QueueJob]) extends JobResult
case class IncompleteResult(blockingJob: QueueJob) extends JobResult

The spawnedJobs is a bit of a technical detail not essential for understanding how the queue is implemented. Check the code comments for why spawnedJobs are there.

Each job has the QueueJobStore associated with it and that's where the queue stores pending completers blocked on a given job.

Let's now look at the performance cost of the additional bookkeeping necessary to implement this queue algorithm.

The performance before job queue work:

# 6de307222a49c65fa149e4261018e2a167a485d5
[info] Benchmark                            Mode  Cnt     Score    Error  Units
[info] BenchmarkScalap.completeMemberSigs  thrpt  120  1927.406 ± 13.310  ops/s

with all job queue changes:

# ab6dd8280ac9cade0f87893924f884e46d8a28dc
[info] Benchmark                            Mode  Cnt     Score   Error  Units
[info] BenchmarkScalap.completeMemberSigs  thrpt  120  1821.394 ± 9.491  ops/s

I lost ~110ops which is 5.5% of performance. This is actually really good result for such a substantial piece of work like major queue refactor and cycle detection algorithm. And let's not forget that 1820ops/s corresponds to over 3.6 million lines of scalap code per second.

Thanks to Filip Wolski for discussions and inspiration on how to approach the problem of cycle detection in a graph computed online.

gkossakowski commented 6 years ago

Today I'm writing a note on why package objects with inheritance are ill-defined in Scala and why this is a headache. I describe a workaround that I implemented in late January 2018. Over two months later, I still have a feeling of surrender to a feature Scala shouldn't have.

To understand the problem with package objects with inheritance, let's consider this example:

package foo {
  package bar {
    class A {
      class B
    }
  }
}
package foo {
  package object bar extends A {
    class C
  }
}

The problem here is that package object bar both contributes members to and consumes members from the package foo at the same time. The class A is consumed (as a parent) by the package object and then the same class contributes members to the package in which it is defined. This is ill-defined.

My original implementation of package object support ruled out this scenario with a cyclic error: package bar was blocked on the completion of the type of package object bar and the package object's type was blocked on being able to lookup in the package bar. However, the pattern of inheriting from a class defined in the same package is widely used in Scala, including Scala's standard library. I bet most Scala users defining a package object like in the example above do not realize they're stepping on an ill-defined language feature. Nonetheless, the code like this is written and I had to come up with some workaround.

I resorted to implementing a "stripped down" version of member lookup in a package when the request comes from a package object. The stripped down version only looks at declarations in the package and bypasses the need for completing the package's type. Specifically, only members declared syntactically within package declaration are considered in the lookup and package object is skipped. This unbreaks the tie I described above and supports the common Scala's coding pattern involving package objects.

I wasn't sure how exactly to implement this special case without sacrificing code structure or performance. I ended up passing an instance of a special package lookup class specifically to a completer of a package object. This achieves two goals at the same time:

only lookup from package objects is special cased in the implementation
it's easy to see that other lookups are unaffected at runtime and the overall performance remains intact

The details of the implementation are in my commit ad1999dd00a5992181bdfb4ab9d0b7c6dee9d437. I'm happy I managed to retreat from the minefield of package objects with parent types. I'm convinced that this trap shouldn't exist in Scala in the first place, though.

soc commented 6 years ago

Your change looks pretty nice and localized in how it deals with a tricky problem. :-)

I bet most Scala users defining a package object like in the example above do not realize they're stepping on a minefield of broken language semantics.

Agree on that. I think the weird syntax for defining package object is largely to blame. I still can't remember how Scala even ended up with this weird syntax, given that instead of

package foo
package object bar { ... }

it simply could have been

package foo.bar
object package { ... }

This would have

been way more consistent in terms of language design
been more intuitive for users to figure out where the source file needs to be placed
had a much smaller language footprint
made the issue way more obvious

Do you remember any details how the current syntax came to be?

adriaanm commented 6 years ago

I'm convinced that this trap shouldn't exist in Scala in the first place, though.

Me too: https://github.com/scala/scala-dev/issues/441

andyscott commented 6 years ago

object package { ... }

is valid and works as desired if you wrap package in back ticks:

object `package` { ... }

@gkossakowski's example can be rewritten as:

package foo {
  package bar {
    class A {
      class B
    }
    object `package` extends A {
      class C
    }
  }
}

gkossakowski commented 6 years ago

@andyscott yes, syntactically, you can bring more clarity. scoping remains icky: object package both queries the outer scope (package bar) and contributes to that scope; combined with inheritance, you get a very strange structure

gkossakowski commented 6 years ago

(i've rewritten my note to be a little bit more scoped) originally I said:

I bet most Scala users defining a package object like in the example above do not realize they're stepping on a minefield of broken language semantics.

but i changed the wording to be:

I bet most Scala users defining a package object like in the example above do not realize they're stepping on an ill-defined language feature.

@soc i don't remember the history of package object introduction, i think they were introduced in Scala 2.8 together with the redesigned scala collections and I think it was clear already at the time that their semantics need more work to be nailed

I'm glad to hear the consensus is to deprecate inheritance in package objects

gkossakowski commented 6 years ago

On April 8, 2018 I pushed changes that make the handling of imports more lazy and I'm writing a note on why. I originally thought that imports can be "bundled" together and resolved in one tight loop with a great performance.

Let's consider this example:

import A.X1
import A.X2
import A.X3

class B extends X2

object A {
  class X1
  class X2
  class X3
}

The previous strategy of handling imports would resolve them upon first import lookup, e.g. for the type X2. All imports would be resolved in one go, top-to-bottom. The reason why all imports were resolved is that the next import could depend on what the previous one imported and my argument was that it's cheaper to simply resolve them all in a tight loop than do anything more complicated. Trying to carefully track dependencies and introduce laziness seemed like unnecessary and potentially harmful to performance due to the cost of the additional bookkeeping. My argument was that most imports will need to be resolved at some point so laziness doesn't add any value.

However, I forgot about the scenario like this that's allowed in Scala:

package a {
  package b {
    // this wildcard import is ok: it forces completion of `C` that is not dependent on `A` below
    import C._
    // import A._ causes a cyclic error also in scalac; only importing a specific name
    // avoids the cycle
    import A.X
    object A extends B {
      class X
    }
    object C
  }
  class B
}

that is: we have forward imports in Scala. Before changing the handling of imports, the import A.X would fail due to following cycle:

import A.X needs to query members of the object A so that object has to be completed
during the completion of the object A, we need to resolve the type B and for that we need to resolve all imports in scope

The patch I pushed today introduces an ImportSymbol on which I can hang a type completer. This makes each import node to be lazily completed when necessary.

The new logic won't consider all imports but only imports possibly matching. The possibly matching imports are the ones with a selector matching the looked up identifier's name.

With this change, the lookup for the type B doesn't trigger the completion of the import A.X because the name B is not matched by the selector X.

This concludes my initial musings on whether imports need to be handled in a lazy manner. They do. The surprising part is that an additional laziness and the bookkeeping associated with it introduced a performance boost.

Before the change:

[info] # Run complete. Total time: 00:04:08
[info] Benchmark                            Mode  Cnt     Score   Error  Units
[info] BenchmarkScalap.completeMemberSigs  thrpt  120  1736.439 ± 8.539  ops/s

and after:

[info] # Run complete. Total time: 00:04:08
[info] Benchmark                            Mode  Cnt     Score   Error  Units
[info] BenchmarkScalap.completeMemberSigs  thrpt  120  1864.463 ± 8.079  ops/s

I don't have a good intuition why. Maybe some imports were not necessary to be resolved at all (dead code) in such a simple code base like scalap?

gkossakowski commented 6 years ago

Today I'm sharing my older note from March 20th, 2018 on collapsing inherited members (or not). I keep this note also in my notes.md file.

Kentucky Mule copies members from parent classes. It does it for two reasons:

To experiment with a faster lookup
To implement a crude version of As Seen From (discussed in notes.md)

I'm going to write a short note on the 1. The idea was to collapse member lookup of a class into a single hashtable lookup. The class can have multiple parents that all contribute declarations. I collapse member lookup by copying members from all parents into a single hashtable.

I want to highlight that this strategy leads to an O(n^2) growth of members in a chain of inheritance. Let's consider an example:

class A1 { def a1: Int }
class A2 extends A1 { def a2: Int }
class A3 extends A2 { def a3: Int }
...
class A10 extends A9 { def a10: Int }

The class A10 has ten members. It transitively collected declarations a9 ... a1. In this simple example the problem doesn't look too bad but O(n^2) should be always a worrisome behavior. While testing Kentucky Mule against the Scala's Standard Library, I noticed that dependency extraction and cycle collapse both take a long time: several seconds. The dependency extraction (the simple symbol table traversal) was taking longer than completing all the types. The problem was with the number of symbols traversed: over 3.5 million.

The strategy of collapsing all members into a single hashtable was motivated by the observation that the write to a hash table is done just once but lookups are performed many more times. My logic was that traversing all parents and asking them for members might be too expensive for every lookup.

Switching between strategies was easy. The benchmarks show that there's almost no drop in completer's performance.

Collapsed members:

[info] Result "kentuckymule.bench.BenchmarkScalaLib.completeMemberSigs":
[info]   6.052 ±(99.9%) 0.109 ops/s [Average]
[info]   (min, avg, max) = (5.560, 6.052, 6.546), stdev = 0.243
[info]   CI (99.9%): [5.943, 6.161] (assumes normal distribution)
[info] # Run complete. Total time: 00:02:21
[info] Benchmark                              Mode  Cnt  Score   Error  Units
[info] BenchmarkScalaLib.completeMemberSigs  thrpt   60  6.052 ± 0.109  ops/s

Traversing the parents for a lookup:

[info] Result "kentuckymule.bench.BenchmarkScalaLib.completeMemberSigs":
[info]   5.935 ±(99.9%) 0.034 ops/s [Average]
[info]   (min, avg, max) = (5.763, 5.935, 6.113), stdev = 0.075
[info]   CI (99.9%): [5.902, 5.969] (assumes normal distribution)
[info] # Run complete. Total time: 00:02:15
[info] Benchmark                              Mode  Cnt  Score   Error  Units
[info] BenchmarkScalaLib.completeMemberSigs  thrpt   60  5.935 ± 0.034  ops/s

With no noticeable difference in performance, I'm going to change member lookup to traversing the parents.