Slow memory leak in Dotty's parser

Blaisorblade commented 8 years ago

Splitting out of #1527 to give a clearer summary.

Dotty interns identifiers and never releases them, even if you drop the compiler instance, because the hashtable used for interning is stored in a static field. While this only leaks memory slowly, I'm not sure that's OK (though @odersky is not convinced this is a real problem). So if one wants to avoid this leak in Ensime/ScalaIDE, it seems one would have to load Dotty in a separate classloader.

I'm trying to anticipate whether this is indeed a problem, so pinging @fommil and @sschaef:

Is that indeed a false problem for Ensime/ScalaIDE?
If the problem is there, is loading Dotty in a separate classloader (and discarding it on different projects) acceptable for Ensime/ScalaIDE? Is that something they do today with Scalac?

Blaisorblade commented 8 years ago

For completeness, let me quote @odersky's response here too:

As long as you compile the same files, the nametable will not grow, since everything is hashed. In my experience, even if you compile different files, the name table grows very slowly. There are simply not so many different names to deal with. So, it could become a problem at some point, but right now I am not seeing it.

fommil commented 8 years ago

@Blaisorblade the reason I bring up https://github.com/fommil/class-monkey is because it is not possible to "discard" a classloader that uses scalac at present because of several locations where files are not closed correctly (fundamentally JarFile's finalizer is never called). Hopefully this is remedied in dotty. I have several workarounds for scalac, including an additional monkey patch in ensime itself https://github.com/ensime/ensime-server/blob/2.0/monkeys/src/main/scala-2.11/scala/reflect/io/ZipArchive.scala

smarter commented 8 years ago

@fommil Right now dotty relies on scala.reflect.io though we'd love to get rid of that dependency.

Blaisorblade commented 8 years ago

@fommil OK, I see that's relevant. Is that the only obstacle or is there more?

fommil commented 8 years ago

that's probably the biggest memory leak that I'm aware of with scalac. Another memory leak we've observed is when the PC just doesn't return and we have to start a new one, but the old one never dies (eating CPU and memory). Having the ability to really demand "look, here, just stop what you're doing old chap" would be fantastic.

Blaisorblade commented 8 years ago

Sounds like you'd like to "isolate" Scalac more. I feared it would be hard.* If we do that, there are fewer reasons to fix the slow leak discussed in this issue.

*EDIT: I mean: I feared that encapsulating the compiler from outside would be an annoyance for clients, in which case on Dotty would have to fix all memory leaks instead of relying on "compiler encapsulation". Relying on encapsulation would reduce non-functional requirements on Dotty (a bit à la Erlang).

kiritsuku commented 8 years ago

I would like to avoid leaking memory, even if it is only a little bit. It simply makes up-scaling of the compilation process more difficult. We already have problems with scalac because some resources are not correctly handled/freed. Especially code generation in our test suite is a pain because we have to ensure uniqueness of types/symbols manually. Resetting an existing scalac instance is simply not possible. Btw, we generate thousands of compilation units in tests, therefore even a small leak may have a larger impact.

Because of complexity overhead, we often don't reuse existing instances of scalac but recreate them. This adds some startup overhead though. However, if dotc is able to minimize the startup overhead, it can leak as much memory as it wants because then we can simply trash it whenever we want. But I'm not sure to which extend it is possible to minimize the startup overhead. The largest problem is that for large projects a lot of dependencies may need to be recompiled.

Blaisorblade commented 8 years ago

@sschaef Thanks—I don't claim to have a full overview of Dotty, but I'd guess this is better fixed earlier than later. But @odersky argued fixing this is not easy, so a few questions:

Btw, we generate thousands of compilation units in tests, therefore even a small leak may have a larger impact.

This leak is from interning names—Dotty stores forever (in an object field) all the unique names it ever saw, so reusing names won't cause leaks.

it can leak as much memory as it wants because then we can simply trash it whenever we want

That won't help with the leak under discussion, since the leaked memory is stored in an object field. That's why I talk about discarding classloaders—unloading a class is the only way to discard static members. (Storing this table in Context is also apparently a problem).

retronym commented 8 years ago

Discarding state by throwing away classloaders is problematic for performance, as you also throw away all the JITted code.

retronym commented 8 years ago

A pragmatic solution here would be to add a clear method to the name table. Obviously that would be thread-unsafe, and could only be used when no compilers are active.

@Blaisorblade What did you mean by:

Storing this table in Context is also apparently a problem

For large multi-project builds, one problem is that you end up with lots of duplication between the name tables of the compilers for each sub-project. Name table sharing would be worthwhile in those cases. This could be still be achieved without a global name table by passing in an existing name table when creating the root context for a new compiler.

Blaisorblade commented 8 years ago

A pragmatic solution here would be to add a clear method to the name table. Obviously that would be thread-unsafe, and could only be used when no compilers are active.

👍 💯 Important nitpick: "No active compiler" doesn't make it safe, but that'd be fixed by memory barriers to prevent reorderings by CPUs/caches/compilers (say, creating a happens-before edge via writes/reads to volatiles—clear would need to write to one, while creating a new compiler would have to read the same volatile). Is this worth mentioning or it is obvious here?

What did you mean by:

Storing this table in Context is also apparently a problem

I just referred (badly?) to Odersky's rebuttal of the proposal in the parent thread: https://github.com/lampepfl/dotty/issues/1527#issuecomment-253164549

Can't we just move the relevant hashtable to whatever the equivalent of Global is?

No, names have conceptually larger scope than a single global or, in the case of dotty, root context). I [tried] that early on in the dotty design and it became a mess quickly.

Blaisorblade commented 8 years ago

A pragmatic solution here would be to add a clear method to the name table.

Forgot: Do we probably want a more generic API for clients, in case it later needs to clear other object fields?

fommil commented 8 years ago

Interning of symbol tables sounds very reasonable to me. We would never parse an entire project in ensime, just the sources the user is interested in.

Well pointed out on JIT. If we can avoid having to use a classloader to use dottypc that would be good. If it used classloaders internally to just load resources, that's fine but please remember to close them after.

Somewhat related is the functionality provided by scalap. With scalap we really do scan the entire classpath, every classfile, and if that were to introduce a memory leak we'd be in huge trouble. Will scalap continue our will there be a replacement for it? We only use a small part of scalap and it might be easier if you just wrote something from scratch.

smarter commented 8 years ago

scalap can't be reused since pickling is completely different in dotty, you can get similar information (and much more) from unpickling tasty, though there's no public API to do this currently.

fommil commented 8 years ago

Ok, good we caught this early, me definitely need that functionality to build or index. Should I create a ticket or will you? Happy to do a walkthrough of our usecase.

odersky commented 8 years ago

I think it's possible to move the name table to the root context of a project (i.e. move it to class ContextState). We'd lose the ability to talk about a name independently of a compiler instance, but we'd prevent possible memory leaks for good. Note that once we start clearing a static NameTable we lose the ability to use names outside of compiler instances anyway.

On Thu, Oct 13, 2016 at 7:12 AM, Sam Halliday notifications@github.com wrote:

Ok, good we caught this early, me definitely need that functionality to build or index. Should I create a ticket or will you? Happy to do a walkthrough of our usecase.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lampepfl/dotty/issues/1584#issuecomment-253416070, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwlVrjPyjjTBJEflvofBNGjhJInJ1urks5qzb2pgaJpZM4KUkE7 . {"api_version":"1.0","publisher":{"api_key":" 05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity": {"external_key":"github/lampepfl/dotty","title":" lampepfl/dotty","subtitle":"GitHub repository","main_image_url":" https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88- 11e6-95fc-7290892c7bb5.png","avatar_image_url":"https:// cloud.githubusercontent.com/assets/143418/15842166/ 7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/lampepfl/dotty"}}," updates":{"snippets":[{"icon":"PERSON","message":"@fommil in #1584: Ok, good we caught this early, me definitely need that functionality to build or index. Should I create a ticket or will you? Happy to do a walkthrough of our usecase."}],"action":{"name":"View Issue","url":"https://github. com/lampepfl/dotty/issues/1584#issuecomment-253416070"}}}

Prof. Martin Odersky LAMP/IC, EPFL

smarter commented 8 years ago

Ok, good we caught this early, me definitely need that functionality to build or index. Should I create a ticket or will you? Happy to do a walkthrough of our usecase.

A walkthrough would be interesting. Note that when I say "public API" I really mean "external API that we plan to support", but there's no external API for anything in dotty yet (except the very limited https://github.com/lampepfl/dotty/tree/master/interfaces), for now using the compiler internal APIs is fine. There's no good usage examples now but I'm currently experimenting with using tasty to provide some IDE features such as jump to definition and (maybe) renaming, I'll keep you up to date.

fommil commented 8 years ago

@odersky for ensime it might be useful to have a shared name table. Consider the case where we are spawning new presentationcompiler instances fairly regularly (e.g. one per sub-project, and for the project's tests, and maybe fresh PCs for refactoring operations).

fommil commented 8 years ago

@smarter what do you mean by "jump to definition"? you'd need to understand where source jars are stored to do this, and also java. It seems like this is an area where what core dotty can do should be limited to avoid duplicated efforts in ensime.

smarter commented 8 years ago

@fommil I'm ignoring java and outside projects, we store the source file path in Tasty so we should have the information we need, I'm not trying to duplicate ensime but rather to evaluate how various parts of dotty (including tasty, which was not explicitly designed for interactive use) can be used in IDEs. I'll have code and a write-up in a few days.

fommil commented 8 years ago

@smarter I'm not sure you have everything you need in there, you won't know the location of the jar or the source directory root. If what you are providing is (relative path from source root, source file name, position in file) then that's perfect. The current classfiles only give (source file name, line number) for java FQNs, so this would be a good upgrade. It's important that you store only the relative path from the root, not the full path. (this will clean up some of the nasty workarounds we need in ENSIME! https://github.com/ensime/ensime-server/blob/2.0/core/src/main/scala/org/ensime/indexer/SourceResolver.scala)

fommil commented 8 years ago

also, this could be awesome for debugging if this information is available at runtime by converting source/line to specific blocks of code! e.g. imagine being able to "step into lambda"

smarter commented 8 years ago

I agree that it should be relative, not sure if it is right now, which is exactly why I'm testing these features :). In any case this is very much off-topic, we can chat more on the dotty gitter if you want.

Blaisorblade commented 6 years ago

For anybody caring: the static table is still there under dotty.tools.dotc.core.Names.table. https://github.com/lampepfl/dotty/blob/e5ff11c111e0051061e3a4edcd9334a9ab76b4ff/compiler/src/dotty/tools/dotc/core/Names.scala#L542

scala / scala3

Slow memory leak in Dotty's parser #1584