Open peterschaer opened 10 months ago
It is true, we also recognized that the image 3.30 runs less stable in Openshift than the image 3.29. I performed some local tests but have not managed to reproduce the error-prone behavior. So we still run the image 3.29.
@svamaa You run the image 3.30 with restarting the container after each print. Is that correct?
We run the image 3.30.6 with very high resource limits (7G) in Kubernetes. We've configured the restart of the pods after 8G of RAM was used.
We run the image 3.29 with 500m CPU and 1100Mi Memory and restart the pods every 2 hours. My impression is that the image requires quite a lot of resources for its task. Is there something related to this image that might cause some unwanted behaviour in Openshift? How could we proceed @peterschaer, @jwkaltz ?
Sorry for the late reply, I was busy with other things. Anyway: @jwkaltz is there a reason why mapfish_print suddenly needs so many more resources? And is there a recommendation on how many resources (especially RAM) mapfish_print needs? That would help us a lot.
@peterschaer I will discuss it with the main contributors of MapFish-Print. I do note that this phenomenon does not occur in most projects using MapFish-Print, there are reports here from you and Basel-Land, it seems the other cantons do not experience this with pyramid_oereb, and we also did not notice this effect in the geoportals running MapFish-Print. We do have a report from Basel-Stadt's geoportal about a memory leak, this is however already the case in older versions of MapFish-Print.
Furthermore, please note that MapFish-Print is currently undergoing big maintenance tasks in order to update the libraries being used and to facilitate its debugging. I do hope that this will help us to clarify what is going on in your (and Basel-Land's) use case.
@peterschaer @michmuel can you please confirm that, in your environment, you have more problems with MapFish-Print 3.30 than with 3.29, and can you provide any more details? There is ongoing work to improve MapFish-Print in general, but if you have anything more specific we should open a new issue in https://github.com/mapfish/mapfish-print
@jwkaltz I can confirm that we have more problems with MapFish-Print 3.30 than with 3.29. At the moment I do not have more details as I have already listed in the initial post of this issue.
@jwkaltz we have still problem, with the new MPF 3.30.11 configured with -Xmx2G -Dmapfish.maxContentLength=8388608
Allocated RAM has reached almost 4 GB and the we have strange behaviors:
@lopo977 can you please provide, as a file, the print specification that MapFish-Print is processing when you have errors? This should be visible in the MapFish-Print logs. This way, we can try to reproduce the error in a local development environment.
@lopo977 can you please provide, as a file, the print specification that MapFish-Print is processing when you have errors? This should be visible in the MapFish-Print logs. This way, we can try to reproduce the error in a local development environment.
Here we are... this file is sometimes processed correctly in 3.30.11 and randomly MFP says that we have an error:.... spec.json
After some times something of weird happens... i.e. out.log
in this case
Caused by: java.lang.NullPointerException: Inflater has been closed
931c3587-9cf0-49dc-a11a-f61a547446c3@0b960d38-fd27-445e-890e-d44bde71c413
java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.base/java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1006)
at org.mapfish.print.output.AbstractJasperReportOutputFormat.getJasperPrint(AbstractJasperReportOutputFormat.java:160)
at org.mapfish.print.output.AbstractJasperReportOutputFormat.print(AbstractJasperReportOutputFormat.java:104)
at org.mapfish.print.MapPrinter.print(MapPrinter.java:133)
at org.mapfish.print.servlet.job.PrintJob.lambda$call$0(PrintJob.java:150)
at org.mapfish.print.servlet.job.PrintJob.withOpenOutputStream(PrintJob.java:120)
at org.mapfish.print.servlet.job.PrintJob.call(PrintJob.java:149)
at org.mapfish.print.servlet.job.PrintJob.call(PrintJob.java:56)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException: null
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:603)
... 12 common frames omitted
Caused by: java.lang.NullPointerException: null
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:603)
at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
at java.base/java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:722)
at org.mapfish.print.processor.ProcessorDependencyGraph.tryExecuteNodes(ProcessorDependencyGraph.java:63)
at org.mapfish.print.processor.ProcessorDependencyGraph$ProcessorGraphForkJoinTask.lambda$compute$0(ProcessorDependencyGraph.java:212)
at org.mapfish.print.processor.AbstractProcessor$Context.mdcContext(AbstractProcessor.java:248)
at org.mapfish.print.processor.ProcessorDependencyGraph$ProcessorGraphForkJoinTask.compute(ProcessorDependencyGraph.java:208)
at org.mapfish.print.processor.ProcessorDependencyGraph$ProcessorGraphForkJoinTask.compute(ProcessorDependencyGraph.java:185)
at java.base/java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.NullPointerException: Inflater has been closed
at java.base/java.util.zip.Inflater.ensureOpen(Inflater.java:740)
at java.base/java.util.zip.Inflater.inflate(Inflater.java:377)
at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:153)
at java.base/java.util.zip.GZIPInputStream.read(GZIPInputStream.java:118)
at java.base/java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.http.client.entity.LazyDecompressingInputStream.read(LazyDecompressingInputStream.java:64)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1309)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:978)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1282)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:953)
at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpResponse.<init>(HttpRequestFetcher.java:96)
at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpResponse.<init>(HttpRequestFetcher.java:81)
at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpRequest.lambda$call$0(HttpRequestFetcher.java:212)
at org.mapfish.print.processor.AbstractProcessor$Context.mdcContextEx(AbstractProcessor.java:269)
at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpRequest.call(HttpRequestFetcher.java:204)
at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpRequest.call(HttpRequestFetcher.java:138)
at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1448)
... 5 common frames omitted
@jwkaltz
I tested MPF 3.30.11 in OpenShift by repeatedly calling the pdf for the same real estate. After some successful requests MFP runs into a time-out. I did not observe this behavior with MFP 3.29.
@jwkaltz
I tested MPF 3.30.11 in OpenShift by repeatedly calling the pdf for the same real estate. After some successful requests MFP runs into a time-out. I did not observe this behavior with MFP 3.29.
The anomalous behaviors of MFP happen, as time passes and the allocated memory always grows. At some point, I think, when it can't allocate more memory, the strange behaviors start. Not related to a specific egrid. With particularly complex funds maybe you get to a critical memory allocation in less time. The anomalies are the same reported by the other authors in this post(issue). I suspect a memory leak.
@jwkaltz Are new information available concerning the image 3.30.11? Are there known issues? Does it run stably at Swisstopo?
@lopo977 @michmuel can you please retest with the latest release, v3.31.1 ?
[UPDATE] please wait when testing. When I test it locally, I get an error, I am checking with the developers.
@lopo977 @michmuel the version v3.31.1 is broken, but this morning it was fixed in a new release v3.31.2, I have tested it locally with the development data of pyramid_oereb and the PDF generation via MapFish-Print is working with this release. Can you please test this release in your environment? Please note that this release includes significant enhancements in metrics gathering and in healthcheck status visibilty. Notably, you can include this healthcheck reporting in your Kubernetes supervision if you like. This data is available at (of course, you may need to adapt the localhost:8680 part depending on your setup): http://localhost:8680/metrics/metrics http://localhost:8680/metrics/healthcheck
Running MPF v3.31.2 Our system is experiencing slower response times and a high number of timeouts.
[UPDATE] It seems there was an issue with the Geodienste WMS. I have now set the version to test.
Running MPF v3.31.2 appears to be unstable. After less than 6h, it doesn't work anymore... with the following error
16:29:23.996 [ForkJoinPool-1-worker-251] INFO o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.AddStaticLayersProcessor' was 0 ms
16:29:23.997 [ForkJoinPool-1-worker-251] INFO o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.AddStaticLayersProcessor' was 0 ms
16:29:23.997 [ForkJoinPool-1-worker-251] INFO o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.SetStyleProcessor' was 0 ms
16:29:23.997 [ForkJoinPool-1-worker-251] INFO o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.SetFeaturesProcessor' was 0 ms
16:29:24.003 [ForkJoinPool-1-worker-251] INFO o.m.print.map.image.wms.WmsLayer - Query the WMS image https://geodienste.ch/db/av_situationsplan_oereb_0/deu?TRANSPARENT=true&FORMAT=image%2Fpng&STYLES=&WIDTH=1740&HEIGHT=988&LAYERS=daten&REQUEST=GetMap&SRS=EPSG%3A2056&BBOX=2721674.0670634615%2C1116438.5887307692%2C2721948.2869365388%2C1116594.3322692309&VERSION=1.1.1&SERVICE=WMS.
16:29:24.616 [http-nio-8080-exec-1] WARN o.m.print.servlet.BaseMapServlet - Error while processing request: task canceled (timeout)
16:29:27.164 [http-nio-8080-exec-1] INFO o.m.p.s.j.impl.ThreadPoolJobManager - Submitted print job 6a10e91f-6034-4eb3-9267-f0066234dae3@2b8a841d-bd06-4408-a1cf-4f206a81ae14
@lopo977 In 3.31.x There was quite a lot of changes in the manner the logs are written. In your example, the Logger level is INFO and WARN, not ERROR. The warning you mentionned is about something that went wrong, but not an error that's stops MFP to run as usual. Amongst 3.31.x changes: MFP is no longer hiding exceptions and the root cause of problems. Hence the number of logs is not relevant. Could you clarify what you meant by more unstable ? It would also help a lot if you shared the output of the metrics and healthcheck endpoints mentioned by @jwkaltz .
@sebr72 @jwkaltz Running MPF v3.31.2 ...The status after some time (not so much).. CPU load near 0 MEM 1.46 GB NET load near 0
Request/response cycles: https://cloud.geo.ti.ch/oereb2/extract/pdf/?EGRID=CH568802070726 504 Gateway Time-out The server didn't respond in time.
https://cloud.geo.ti.ch/oereb2/extract/pdf/?EGRID=CH540702893564 504 Gateway Time-out The server didn't respond in time.
https://cloud.geo.ti.ch/oereb2/extract/pdf/?EGRID=CH769002071095 504 Gateway Time-out The server didn't respond in time.
[metrics] metrics.json
[log] mfp.log
I cannot put this version in production. Version 3.29 works very well.
I have to revert the production system to the mfp 3.29. I am sorry. I produced all feedback I have in this moment.
It seems that jobs are queued (looking at healthcheck.json) "WIP: Number of print jobs queued is above threshold: 4. Number of print jobs waiting is 13"
Could you help us please?
Regards.
Good afternoon
Mr Riollet. @sebr72
As agreed, I will send the mfp code and 3 spec files that you can paste into the form http://localhost:8680/
if you then push the “create and get print” button a few times, on 3 different tabs, with the 3 specs in 3 forms, after a few times the program stalls.
To start mfp type: $ make serve
Thank you for supporting us. attachments.zip
Regards, Luca
We have upgraded our application to pyramid_oereb 2.4.8. pyramid_oereb_mfp uses the latest commit (8d9da5ab3a8d1a8469fd693e4f86c5870ee9f6dc). At the same time, we have upgraded mapfish_print - as recommended in https://github.com/openoereb/pyramid_oereb_mfp/blob/8d9da5ab3a8d1a8469fd693e4f86c5870ee9f6dc/docker-compose.yml#L8 to version 3.30.2. However, this constellation causes problems in our test environment. Everything actually works perfectly - with the exception of PDF creation. This is not running stable. On the one hand, certain complex extracts (large geometries, many PLRs) are not created at all or only very rarely (HTTP error 500) and on the other hand, a less complex call sometimes results in an HTTP error 500 if it is called several times in succession. In the application log file, we observe errors such as
12:41:41.866 [http-nio-8080-exec-9] WARN o.m.print.servlet.BaseMapServlet - Error while processing request: java.lang.NullPointerException: Inflater has been closed
I have attached all log messages of an erroneous request: single_failed_request.log and here is a log of multiple successful and failed requests: reporting_oereb2_test.zip
Our application runs in a Kubernetes cluster. The mapfish_print pod has 1.8 CPU and 2.2 GB RAM assigned to it. The image is configured with
-Xmx512M -Dmapfish.maxContentLength=8388608
In the production environment (pyramid_oereb 2.4.4 and mapfish_print 3.29.4) the pods run with exactly the same resources and these problems do not occur. My question now: have other cantons also observed this behavior? I assume that at least @michmuel has observed something similar (https://github.com/openoereb/pyramid_oereb/pull/1814#issuecomment-1711359398). Does anyone know what these errors mean? Lack of resources? Other causes?