paketo-buildpacks / java-azure

A Cloud Native Buildpack with an order definition suitable for Java applications on Azure
Apache License 2.0
3 stars 3 forks source link

Memory-calculator blocks the application to run in container with memory limited to 512Mi #208

Open allxiao opened 3 years ago

allxiao commented 3 years ago

What happened?

If I run a CNI image build by the java-azure buildpacks for a Spring Boot application in a container with only 512Mi memory, it will fail in the memory-calculator step and panic.

Repro steps:

  1. build a CNI image from a Java application with the java-azure buildpacks
  2. run the CNI image in a container with 512Mi memory limit (e.g., Kubernetes pod)
  3. check the container status and logs

Build Configuration

Checklist

dmikusa commented 3 years ago

Sorry, I'm not sure I understand the problem you're reporting. The memory calculator is designed to estimate the memory needs for your application and ensure that the application can run correctly within the given memory limits. If memory limits are not sufficient for the application, it intentionally fails so that you know this and can adjust things accordingly. The behavior that would occur if you ignore the memory calculator is that your application would start and then fail at some unpredictable time because you exceed the memory limits in your container (and your container orchestrator will kill it). The fail fast/early behavior is intentional and by design.

In our case, we host the user application so we are not able to guess the configurations such as BPL_JVM_HEAD_ROOM, BPL_JVM_LOADED_CLASS_COUNT and BPL_JVM_THREAD_COUNT. ​If user do have requirement to customize the JVM configurations, they can specify through the JAVA_OPTS environment variable.

Can you expand on what you're doing here? How do users deploy their apps? How are you running the application, but unable to know the correct memory settings to run the application? What good would it do a user to set -Xmx if you are setting a max memory limit for the container of 512M? At the end of the day, if you cap the max memory of the container to 512M, everything has to fit within that or the app is going to fail (if not at start up, it'll fail at some later point when it tries to get more memory and cannot).

allxiao commented 3 years ago

Hi @dmikusa-pivotal , we are integrating kpack in Azure Spring Cloud service. @nebhale has more context about this. I created this item here for reference. We will discuss on this issue in the sync-up next week.

In our case, we build the source for the user and host the applications for the user. So we do not know what memory layout would a user need as that varies from case to case. It's the user's choice to choose the 512Mi sized container, and they are responsible to make it run or fail.

dmikusa commented 3 years ago

Sorry, I still don't understand why removing the memory calculator would be a good thing here. If the user is picking unsuitable values for the memory limit, I'd think you'd want to let them know that as soon as possible. Happy to chat more and try to understand your use case though.

allxiao commented 3 years ago

It's a customer ask to provide the option for 512Mi memory limit. Some customer applications are relatively simple and they run well in small containers. It will reduce the overall cost in their microservice infrastructure.

As a platform provider, we offer the hosting environment for the user, and cut the boundary of responsibilities:

I think we need to offer the customer the flexibility to opt-in / out some optional configuration, or override them.

dmikusa commented 3 years ago

Thanks for explaining further.

The issue here in my mind is that allowing the user to opt-out isn't really going to help. Even with an opt-out, your customers are going to pick the 512M limit, which is too low, the memory calculator is going to fail and they are going to contact you. You could tell them to use a theoretical opt-out config setting, but without making any change or giving them the ability to run in a container that will eventually fail, you could instruct them to tune their memory settings to fit within the 512M limit. That's something they can do now, without changes, and it will not result in a situation where their container could fail down the road. It also allows the user to pick what trade-offs they make to run in the lower memory environment (like do they reduce thread stack size or reduce code cache size).

I can appreciate not wanting to have the support call in the first place, but the only way that's going to happen is by changing the default for the memory calculator to either not run, or to perhaps run in a way where it will WARN when memory settings are not sufficient, rather than fail.

An option I could see working, without having to disable the calculator, would be having a more detailed message when there is a failure. Right now it says fixed memory regions require 631278K which is greater than 512M available for allocation: -XX:MaxDirectMemorySize=10M, -XX:MaxMetaspaceSize=119278K, -XX:ReservedCodeCacheSize=240M, -Xss1M * 250 threads, which is helpful. It tells you the max that your current settings require. It doesn't provide instructions on what to do next though. We could adjust that message to provide further configuration, like You need to set $JAVA_TOOL_OPTIONS and adjust the JVM memory settings such that they fit within your container memory limit.. I could possibly even see adding a feature where we could allow a configurable doc link or KB link to be embedded in the message, that way we could point users directly to an article with even more instructions for how they proceed.

Thoughts?

allxiao commented 3 years ago

Here's one of the real customer's voices which drove the feature to support 512Mi containers:

We have ~20 microservice apps. Currently they run in 8 core 16G space well as most of them deal with simple tasks which are not resource consuming. When we tried to move to Azure Spring Cloud, we have to choose larger container size, which resulted in higher cost than it actually requires.

In development the customer tends to choose basic service tier and small sized container to reduce the cost too.

(at the same time, we have customers deploying large amount of applications which require 4G ~ 8G memory)

If it fails with OOM, we show the error in the app logs and they know they need to tweak the memory settings, by enlarging the container memory limit, tweaking the JVM options, or optimizing their code.

We offer two types of deployments: from built JAR or from source built by kpack+buildpacks. We wanted to provide similar behavior for both types. However, it diverges for this 512Mi container:

I understand setting some "reasonable" configuration may help in some cases, however, it's difficult to judge the real user requirements. It should not be mandatory. If it's not going to make a good guess, it may be better to let the user decide what they want rather than forcing them to learn every aspect of the internal details and do it as required.

dmikusa commented 3 years ago

if I need do the same for deployments from source, it's not that easy. I see an error in the log saying that memory-calculator failed, but when I search in Google, it's not easy to find any support on this.

The memory calculator understands all of the standard JVM memory arguments (-Xmx, -Xms, -Xss, etc...). If a user were to set these JVM arguments, the memory calculator will back off and only adjust the remaining unset memory arguments. The memory calculator isn't forcing the user to do anything beyond properly address their memory concerns upfront, rather than waiting for OOMEs (or worse containers to be killed).

In the case of running within 512M of RAM, that likely means lowering the heap size (and accepting that your app can run with less heap space), lowering the thread stack size (and recognizing that you might end up with StackOverflowExceptions), lowering the code cache size (and recognizing that might impact the performance as the JVM will have less space to cache JIT'd code), or some combination of the three.

The way a user sets these standard JVM memory arguments is through $JAVA_TOOL_OPTIONS which is a standard environment variable that's supported by the JVM. It's not specific to memory calculator or buildpacks in any way. It is just a convenient way to set these values as opposed to passing them in as arguments (nothing is stopping you from passing them in as arguments, I believe that should work fine too).

The BPL_* env variables you mentioned are seldom required. They are for very specific use cases where you need to do something like telling the JVM to not use the entire container memory limit (that's headroom), or set the estimated thread count if you have an application that is consuming a very high number of threads. I don't think you'd ever set the class count, as that just reflects in the metaspace size which you can easily set with standard JVM options.

Searching memory-calculator just show you some random pages. I need to know kpack, buildpacks, java-azure buildpacks, memory-calculator, and how it is invoked before I know why it kicks in my process and block it from starting. Eventually, I may get to the page https://paketo.io/docs/howto/java/#configuring-jvm-at-runtime and I need to specify every configuration in JAVA_TOOL_OPTIONS to override what are going to be specified by memory-calculator.

And if the memory calculator's failure message linked directly to this page (or perhaps a configurable page?) as I proposed, would that be sufficient? It certainly removes the need to search for anything, and it still enforces the good practice of addressing memory issues upfront.

I understand setting some "reasonable" configuration may help in some cases, however, it's difficult to judge the real user requirements. It should not be mandatory. If it's not going to make a good guess, it may be better to let the user decide what they want rather than forcing them to learn every aspect of the internal details and do it as required.

That's what's happening. It's telling the user that user intervention is required. At the 512M container limit, it's not possible to configure the JVM in ways that are not potentially going to impact performance or cause other potential issues (i.e. there are trade-offs if you go that low). Thus it fails and instructs the user that their intervention is required. It is definitely not doing this in a very clear way, no arguments there, which is why I think we should improve the output of the memory calculator.

If we just disable the memory calculator or if we have it WARN but not fail, then when users run with 512M of RAM they are in a precarious situation. If enough load hits and there is enough memory pressure, the app will hit its limit and fail in ways that could be unpredictable. It might fail with an OOME, it might not since the JVM thinks it has access to more memory. What you may end up with is the container going over its limit and causing the OOM killer to be invoked, a fate that isn't as clear in terms of how you should proceed (it doesn't tell you what caused the memory to go over the limit). I've seen investigations along these lines consume quite a bit of time. That's why the memory calculator forces you to do this upfront. It's telling you exactly how much memory you need & how much you have (i.e. your limit), you just need to adjust the memory settings so everything will fit.

allxiao commented 3 years ago

In the case of running within 512M of RAM, that likely means lowering the heap size (and accepting that your app can run with less heap space), lowering the thread stack size (and recognizing that you might end up with StackOverflowExceptions), lowering the code cache size (and recognizing that might impact the performance as the JVM will have less space to cache JIT'd code), or some combination of the three.

We do not expect that the user runs an application that requires 1Gi memory or more in a 512Mi container. The same applies when an application requires 2Gi or more memory, it won't fit in 1Gi container. Memory-calculator won't be able to do static analysis and mitigate such case. In our production environment, for applications that requires more memory, the user chooses 4Gi ~ 8Gi memory, and with multiple replicas.

Some scenarios that the user needs small sized containers:

The way a user sets these standard JVM memory arguments is through $JAVA_TOOL_OPTIONS which is a standard environment variable that's supported by the JVM. It's not specific to memory calculator or buildpacks in any way. It is just a convenient way to set these values as opposed to passing them in as arguments (nothing is stopping you from passing them in as arguments, I believe that should work fine too).

If I only adjust one of the JVM settings, memory-calculator is going to fill the other settings, which is going to exceed the limit. So "I need to specify every configuration in JAVA_TOOL_OPTIONS to override what are going to be specified by memory-calculator" to make a perfect guess. However, that requires much more investigation on all the JVM arguments and also the real application JVM memory analysis before the I can set reasonable values and proceed with the testing. From our past experience, a lot of the users just run away and stop evaluating the product.

If we just disable the memory calculator or if we have it WARN but not fail, then when users run with 512M of RAM they are in a precarious situation.

That's theoretically true. In the scenarios mentioned above, it's not the user's top focus whether it fails (or not) some days later. If that happens, move to large sized containers (that's easy answer both for the user and for our support). For serious case, they would reserve more memory buffer rather than insist on resource limited containers. However, in the current situation, we need to explain in detail how they can get the application running at Day 1 (if we have the chance). As the discussion here shows, it's not easy to explain.

dmikusa commented 3 years ago

If I only adjust one of the JVM settings, memory-calculator is going to fill the other settings, which is going to exceed the limit. So "I need to specify every configuration in JAVA_TOOL_OPTIONS to override what are going to be specified by memory-calculator" to make a perfect guess.

This is not correct. If you adjust the memory settings and lower thread stack size or code cache size (or other listed entry), it'll be sufficient, you just need to ensure you're lowering it sufficiently. The memory limit is the limit.

In your example, you have 631278K (about 617M) required but a limit of 512M. That's a difference 105M. If you lower the thread stack size to 256k, that's a reduction of 193.5M which is plenty for the app to start up.

> docker run -p 8080:8080 -m 512M -e JAVA_TOOL_OPTIONS='-Xss256k' apps/spring-music
Setting Active Processor Count to 6
Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx71036K -XX:MaxMetaspaceSize=133251K -XX:ReservedCodeCacheSize=240M (Total Memory: 512M, Thread Count: 250, Loaded Class Count: 21112, Headroom: 0%)

The heap is dynamically adjusted to what's left, which in this case amounts to -Xmx71036K. That was sufficient in the case of Spring Music, it could be for other apps. It would be up to the app dev to know. In fact, I was able to run a Spring WebFlux app without adjusting any settings under the 512M limit.

As always, an app dev can adjust memory settings further if required but the requirement is from their app not from anything the buildpack is doing. If they know what memory settings are required, as you said, it should be easy to set them because we use the same JVM options.

The detailed output of the memory calculator and the protections it offers helps users to understand what is actually required to have their apps run successfully and where the memory they are paying for is being used. It allows for more informed users and avoids common pitfalls. If a few users/use-cases choose to ignore that because it's inconvenient, I'm not sure that justifies disabling the feature. I'd be happy to invest some time into making the output more helpful and easier for users to implement, but I don't really want to disable what is a useful feature for many.

dmikusa commented 3 years ago

@allxiao - I have proposed an RFC for the Java buildpacks that might help with some of the problems you've described. You can check it out here & comment if you have suggestions/concerns.

https://github.com/paketo-buildpacks/rfcs/pull/111

tigerinus commented 2 years ago

@allxiao - I have proposed an RFC for the Java buildpacks that might help with some of the problems you've described. You can check it out here & comment if you have suggestions/concerns.

paketo-buildpacks/rfcs#111

I saw this RFC has been merged to master about 2 weeks ago. Is there a timeline we can expect for the actual change for the low-profile scenarios? Thanks.