puniverse / quasar

Fibers, Channels and Actors for the JVM
http://docs.paralleluniverse.co/quasar/
Other
4.56k stars 574 forks source link

Allow ThreadLocal manipulation/serialisation to be disabled #158

Closed mikehearn closed 8 years ago

mikehearn commented 8 years ago

So far this feature is a massive net negative, at least for me. Things put in ThreadLocal's are invariably not designed to be serialised, it involves poking around in JVM internals and the JVM treats ThreadLocal's specially - back when I was hitting serialisation issues the JVM would segfault due to the use of Unsafe.

Overall, I'd rather have a flag to disable this and have Quasar leave TLS slots alone.

pron commented 8 years ago

In general, I'm not too fond of providing flags for things that modify core behavior, but you may be right that we should simply not try to serialize TLs. My question is this, though: what's wrong with a best-effort attempt to serialize TLs, and not serializing those that fail serialization? From what I can tell (it's been a while since I touched this code) the current implementation tries to decide in advance whether the TL can be serialized or not. This, indeed, seems wrong, and maybe we should just try to serialize and give up if any exceptions are thrown (similarly in deserialization). I'm not too fond of swallowing exceptions, either, but this may be the right approach in this case.

So far, you've had the most experience with this feature. What do you think of a best-effort approach, and where do you see it failing?

mikehearn commented 8 years ago

Well, serialisation has been a pain it's true, but I'm also a bit concerned at the poking of the insides via Unsafe. TLS slots are performance sensitive and likely to change as the JVM is optimised. As none of my fibers care about TLS I'd rather trade it off for increased robustness and compatibility.

The current serialisation issue with TLS is actually ThreadLocal subclasses, so the existing code that tries to decide whether to serialise the contents or not doesn't work, as it's the ThreadLocal object itself that causes the problems.

pron commented 8 years ago

Serialization aside, fibers must support TLs or too many things just won't work (worse -- they'll act very strangely b/c the TLs value would change every time a fiber executes on a different thread). Supporting TLs means using Unsafe (at least for now...)

mikehearn commented 8 years ago

What breaks if TLS swapping is disabled?

pron commented 8 years ago

Plenty. People run JAX-RS and servlets on fibers. Those often rely on TLs. Also, people use Clojure's dynamic bindings. However, the experimental Continuation class (not in the master branch), which we intend to merge along with the support for automatic instrumentation planned for Java 9, already has this option (the detached argument). The reason for that is that not all continuations are (abstract) threads, but all fibers are, and so are expected to behave like threads.

mikehearn commented 8 years ago

Ah ha! That sounds like the right abstraction for me (at the moment at least). Great, well, things are working for now (sorta .... I had to just disable the errant feature when run inside gradle), so perhaps I'll wait until Continuation is available and then switch to that.

pron commented 8 years ago

If you're curious as to why we're waiting with continuations, the reason is that 99% of the cases where people want a continuation, what they really want is a fiber, and when the more primitive continuation is really necessary, it is usually for some advanced stuff that would be annoying to do with manual annotations. For a complete explanation of the theory, read/watch this.

mikehearn commented 8 years ago

Sure, I do understand. The way I'm using fibers/continuations is a bit unusual, as you know (it's really about the programming model rather than performance), so for now I'm trying to simplify as much as possible.

stepancheg commented 8 years ago

+1 for the issue.

I'm currently trying to deal with huge memory leak due to thread-local handling in quasar.

I have an instance of non-thread-safe structure in thread-locals (per-thread cache which is not thread-safe). Quasar moves that data structure to fiber. I have hundreds of thousands of fibers. So I have hundreds of thousands of such structures (~50K each) instead of just 16.

I think, quasar could have one of or both:

BTW, is there any quick workaround for the issue?

pron commented 8 years ago

So you want to keep the TL per-thread? Take a look at TrueThreadLocal. It's specifically designed for striped data structures.

stepancheg commented 8 years ago

Thanks for the TrueThreadLocal. Although, it is partial solution of the problem, because code that does caching is in the library which does not depend on quasar.

pron commented 8 years ago

The best solution for striped data structures is CoreLocal/ProcessorLocal which has been discussed by Doug Lea, I think, for some time, but I don't know if and when it is coming.

The problem is that a ThreadLocal and a core-local object used for striping are supposed to have very different semantics, and it's impossible to tell which is the one you want. The best solution, I think, is to let Quasar know about certain ThreadLocal subclasses which it should replace with TrueThreadLocal.

mikehearn commented 8 years ago

Another appeal for the ability to disable TLS serialisation. I'm going to fork the library to add this now, because today we hit yet another problem caused by this feature and it's just way past the cost/benefit ratio for our use case. I simply do not want my fiber to capture things I am not aware of - ever - and whilst reasoning about things on the stack and things reachable from the stack is not that hard, reasoning about what random libraries or things inside the JDK might have stashed in TLS slots is extremely hard.

This time apparently it was possible to serialise a sun.nio.ch.Util$1 class but not deserialise it, for some reason. The actual error calls the class "2222sun.nio.ch.Util$1" .... so perhaps some kind of stream corruption, but it happens when serialising fiberLocals.

pron commented 8 years ago

If you add a flag that is only used in the serialization-related methods, I have no problem with merging it (as long as the default is to serialize TLs; as I said, we'll do more work to make this more robust, and other languages -- like Clojure -- rely on TLs quite a lot).

mikehearn commented 8 years ago

OK, cool, thanks.

mikehearn commented 8 years ago

This has been done and merged

stepancheg commented 8 years ago

The issue is only partially fixed: it is not possible to tell quasar to preserve thread-locals when switching fibers.

mikehearn commented 8 years ago

This issue was specifically about serialisation. I guess a separate issue would need to be filed for tighter control over thread local handling.

pron commented 8 years ago

@stepancheg Could you explain exactly what problem you are facing?

mikehearn commented 8 years ago

I think his issue is the same as he described above, no? He wants "true" thread locals to avoid needless loss of cache efficiency.

pron commented 8 years ago

@stepancheg Are you familiar with TrueThreadLocal?

stepancheg commented 8 years ago

@pron TrueTreadLocal is not always a suitable solution, because it requires for all libraries (which use ThreadLocal) to be dependent on quasar.

And even if thread-locals are meant to be per-fiber, for me it is safer to get incorrect program (if some library depend on thread-locals) than deal with hard to find memory leaks.

pron commented 8 years ago

I understand. The problem is that I am hesitant to add an option requesting what is essentially incorrect behavior. More fine-grained control over TLs and scheduling would be possible when we offer continuations for Java 9. The workaround is to create a Fiber subclass that overrides onPark and onResume, and then use reflection to call install/restoreFiberLocals to undo the fiber's default behavior.

stepancheg commented 8 years ago

The problem is that I am hesitant to add an option requesting what is essentially incorrect behavior.

I think it is not big deal, when the option if off by default, but I got your point.

The workaround is to create a Fiber subclass that overrides onPark and onResume, and then use reflection to call install/restoreFiberLocals to undo the fiber's default behavior.

Dependence on private API is a shot in the foot. It causes headaches for future developers.

More fine-grained control over TLs and scheduling would be possible when we offer continuations for Java 9.

How is it related to Java 9? AFAIU, only new Java 9 feature related to quasar is stack-walking API.

pron commented 8 years ago

How is it related to Java 9?

From our experiments, directly using continuations would require lots of suspendable annotations, and so we've decided to properly add them only when annotations would no longer be necessary (thanks to the stack-walking API). Unlike fibers, continuations aren't threads, and so their API allows for choosing whether or not to manage thread-locals.

stepancheg commented 8 years ago

The workaround is to create a Fiber subclass that overrides onPark and onResume, and then use reflection to call install/restoreFiberLocals to undo the fiber's default behavior.

Could you at least please make these functions protected?

Current thread-local handling in fibers is memory-leak minefields.

pron commented 8 years ago

What's the problem with calling them reflectively? I don't want to add methods to the API until there's a more permanent solution.

stepancheg commented 8 years ago

What's the problem with calling them reflectively?

Because reflective code will break on some newer version of quasar.

pron commented 8 years ago

I understand, but I don't want to expose those methods precisely because we may want them to go away when there's a better solution, so making them protected would be worse...

stepancheg commented 8 years ago

Just found another leak. This time because of async-http-client: https://github.com/AsyncHttpClient/async-http-client/blob/master/client/src/main/java/org/asynchttpclient/util/StringUtils.java#L21

I have no cases where I do need per-fiber locals, but I have a dozen of cases where hacked thread-locals cause excessive memory usage.

Anyway, I end up redeclaring Fiber subclass in co.paralleluniverse.fibers package (java 1.8 still allows it), and overriding installFiberLocals and restoreThreadLocals to do nothing, and it seems to work as I expect. May I suggest to declare these two functions protected (probably with modified signatures to accept thread-local-map objects) instead of package-private until you find a better solution?

mikehearn commented 8 years ago

How about this - mark them as protected but deprecated. People who use them get a warning. The users are happy. The API is not committed to. They can be made private again when a better solution exists in Java 9. Win/win.

stepancheg commented 8 years ago

How about this - mark them as protected but deprecated ... They can be made private again when a better solution exists in Java 9.

I think that's OK if quasar developers are committed to preserving this feature in future versions of quasar with (maybe) different API.

pron commented 8 years ago

Here's the thing. When you use fibers (as opposed to continuations), there is never reason not to want TLs, except when a particular TL is actually used not for thread-specific data but as a striping mechanism (a ProcessorLocal, really). The workaround I described would only work if you know for certain that all of the TLs are like that. It is certain to break the semantics of arbitrray code.

Any solution -- even an inconvenient one -- must differentiate between the two different desired semantics of TLs. I see no point in marking as depracated methods that are certain to break code. I am, however, happy to consider a hook that lets you examine TLs one by one and decide if you want to to keep it on the thread or in the fiber.

stepancheg commented 8 years ago

When you use fibers (as opposed to continuations), there is never reason not to want TLs, except when a particular TL is actually used not for thread-specific data

In practice there is a huge reason not to use thread-locals in a way it works in quasar. Library developers tend to store large amount of data in thread-locals assuming that there are only few threads and total memory overhead would be small.

I see no point in marking as depracated methods that are certain to break code.

The point is that current implementation of fibers does not allow to use certain libraries without loosing a lot of memory (with inherent GC overhead) due to data stored in thread-locals. When you have millions of fibers that becomes a problem.

I am, however, happy to consider a hook that lets you examine TLs one by one and decide if you want to to keep it on the thread or in the fiber.

I think that would be perfect.

BTW, at the moment I'd suggest to add StrandLocal class which is an opposite of TrueThreadLocal, which should work even if fiber-locals based on ThreadLocal are not saved-restored.

pron commented 8 years ago

Library developers tend to store large amount of data in thread-locals assuming that there are only few threads and total memory overhead would be small.

I understand. But code that relies on TLs is 100% sure to break if they are not supported correctly. The only case when it won't break is when a TL is used as a "processor-local" for striping, and this is the minority of cases.

The point is that current implementation of fibers does not allow to use certain libraries without loosing a lot of memory (with inherent GC overhead) due to data stored in thread-locals.

Again, code that uses TLs cannot work at all if they're not properly supported.

BTW, at the moment I'd suggest to add StrandLocal class which is an opposite of TrueThreadLocal, which should work even if fiber-locals based on ThreadLocal are not saved-restored.

The only sensible implementation that allows code to work at all is copying thread locals by default. There will not be an option to not save TLs by default in a fiber (though maybe in a continuation), as there is no possible way this could ever work. Use of TLs for striping is the exception, not the rule.