playframework / playframework

The Community Maintained High Velocity Web Framework For Java and Scala.
http://www.playframework.com
Apache License 2.0
12.56k stars 4.1k forks source link

[2.6.3] NumberFormatExceptions on every incoming query #7812

Closed fommil closed 6 years ago

fommil commented 7 years ago

Upgrading to sbt-play 2.6.3

I'm seeing exceptions with every request when using the netty backend (not confirmed with akka-httop), I isolated the problem to these two config parameters: https://www.playframework.com/documentation/2.6.x/ScalaBodyParsers#Max-content-length

the config parser is trying to load the defaults (128KB and 10MB) as Long, then Double, then finally as the custom format. This causes two guaranteed exceptions, every query.

The real fix is to fix the parser, the hack is to change the defaults to

play.http.parser.maxMemoryBuffer = 131072
play.http.parser.maxDiskBuffer = 10526720
cchantep commented 7 years ago

Please paste the stacktrace

fommil commented 7 years ago

I don't have it anymore. I fixed the bug and moved on.

gmethvin commented 7 years ago

Do you mean an uncaught exception or simply an exception that is thrown and caught by the config parser? What did you modify to fix it?

Not knowing your app it's hard to know if you're running into an unusual edge case, but the HttpConfiguration should be created once when your app starts. Afterward there is no need to read the configuration again. I don't see how it could happen on every request.

fommil commented 7 years ago

Exceptions kill performance, that's what I care about. Fire up yourkit against any play app and watch the exceptions counts rise and rise.

gmethvin commented 7 years ago

I agree, exceptions kill performance and we should certainly avoid exceptions happening on every request.

I just tested with a very simple Play app running locally with yourkit. I'm not seeing NumberFormatExceptions on every request. Exception counts do not go up just from making an HTTP request, at least not with the Netty server.

With Akka HTTP, I actually do see two exceptions:

So perhaps there is something that can/should be done about those. But I think that would need to be fixed in Akka HTTP, not Play.

There are NumberFormatExceptions coming from typesafe config, but the configuration is only read once on every reload of the app, not on every request. Maybe you are seeing exceptions on subsequent requests because you're running in dev mode and have modified the code between the requests. In that case the application is being reloaded, so it's not an accurate representation of what you'd see in production.

fommil commented 7 years ago

we're not running dev mode, this is prod.

I've already moved on from the code that produced this, so I'm not able to reproduce the stacktrace for you, but from internal notes we were seeing the exceptions raised from HttpConfiguration specifically these lines

   maxMemoryBuffer = config.getDeprecated[ConfigMemorySize]("play.http.parser.maxMemoryBuffer", "parsers.text.maxLength")
          .toBytes.toInt,

https://github.com/playframework/playframework/blob/2.6.3/framework/src/play/src/main/scala/play/api/http/HttpConfiguration.scala#L191-L195

And, from memory, this was being triggered by a netty connection handler callback.

I was also surprised that this was happening on every connection. Perhaps it is related to pooling?

I didn't see the problem with akka-http

Probably the core problem is with very inefficient logic in typesafe-config, but the workaround I'm presenting here is a reasonable move for Play to make and the creation on every request is perhaps an indicator of a deeper problem.

fommil commented 7 years ago

ah, no, you're right about akka-http... I'm also seeing loads of NotEnoughDataException or RuleNotFoundException. Can't be good for performance... I'm seeing it for every request.

fommil commented 7 years ago

and also a lot of parboiled2 exceptions. These I can stacktrace, but I'll create a separate ticket for them.

emilypi commented 6 years ago

Just to bolster this thread, I'm seeing the same thing in Yourkit, and here are some screenshots. It looks like there's a typesafe config read on cache reload. This happens often enough to make an impact.

This is purely functional code, and the only exceptions being thrown are those thrown by either our tracing + metrics frameworks, or Play - Play being the vast majority.

jroper commented 6 years ago

Just to comment on the Akka HTTP exceptions - exceptions are only expensive to create and throw if they fill in the stack trace, it's filling the stack trace that is the expensive operation, otherwise, creating an exception is really cheap, and throwing one is also cheap. The Akka HTTP NotEnoughDataException and RuleNotFoundException both do not fill in the stack trace (and they are singletons anyway, so even if they did it would only be done once for the lifetime of the server), and so present no problem.

jroper commented 6 years ago

I think the most likely scenario here, assuming no one is running dev mode in production accidentally, is that the Play.currentApplication is never actually being set for some reason, but there's something in the Netty server backend that is depending on it, and so each time it goes to read the config from the latest application, it finds no application, and so falls back to loading the config from scratch, which would be incredibly expensive.

fommil commented 6 years ago

@jroper I think it's pretty obvious from the stack traces and performance regressions that we're talking about the slow kind here.

richdougherty commented 6 years ago

Thanks @fommil and @emilypi for reporting this. I've reproduced the issue with caching on the Netty backend.

I think this is an issue with the fact that the caching code uses reference equality to check if the application (wrapped in a Try) has changed. The reference equality check is good for efficiency, but we've made a mistake in the Netty backend and we're accidentally rewrapping the application in a new Success object partway through the request flow. This breaks the caching and results in the configuration being reloaded on every request (or even more than once per request). The Akka HTTP backend is not affected because it always accesses the cache with the same object.

I think the fix will be pretty easy. We can refactor the Netty flow so it preserves the same Try[Application] through the request handling. This can be done by changing the handleAction method so it accepts a Try[Application] parameter instead of an Option[Application]. This allows us to avoid converting from Try to Option and back to Try again, so we keep the same object.

https://github.com/playframework/playframework/blob/2.6.12/framework/src/play-netty-server/src/main/scala/play/core/server/netty/PlayRequestHandler.scala#L269-L274

The harder part will be to write a test for this so it doesn't happen again. Perhaps we can mock out the configuration loading code in a test so that every configuration load triggers a counter. That way we can test how many reloads happen when requests occur.

fommil commented 6 years ago

I've written some optimised "fast exit caching equals" methods in the past on domain objects, but I literally just archived the project with all my thoughts written down because I have no time to carry through with it as a hobby... if you care, open up the projects.json file at https://gitlab.com/fommil/attic/tree/master/stalagmite (which is a dump of the issue tracker) and look for "fasterEquals". TL;DR you could destructure your Try and perform identity equality (or even increase your scope to value equality) on a subset of the contents.

richdougherty commented 6 years ago

@fommil, in this case it would probably be fast enough to go with a normal equals method actually, since it would just be a check for Success then I believe Application will do a reference equality check anyway.

For this bug I think the main part of the work is writing a unit test to verify the caching behaviour. The way we choose to do equality is probably not too tricky to fix.

(Another thing to fix up is our performance regression testing infra - it would probably have caught this.)

https://gitlab.com/fommil/attic/tree/master/stalagmite

Cool project btw. :)

fommil commented 6 years ago

Thanks for the fix @richdougherty ... the true cause of the exception was much worse than I originally thought. My team had already migrated to akka-http, which is a small step in the direction I'd like to go, but I'm sure others will be interested in the netty fix.

richdougherty commented 6 years ago

This issue should be fixed now with the release of 2.6.13. The fix for master is not yet merged.

@fommil I saw some config reloading happening with the Akka HTTP backend during unit testing. I fixed that too along with the Netty reloading. On master where we have benchmarking availabile I also verified that there was a performance win for both Netty and Akka HTTP backends.

richdougherty commented 6 years ago

This is now fixed on master with #8335.