openoereb / pyramid_oereb_mfp

Implementation of a static extract (PDF export) for the pyramid_oereb server with MapFish-Print
1 stars 3 forks source link

mapfish_print 3.30.2 not running stable #145

Open peterschaer opened 10 months ago

peterschaer commented 10 months ago

We have upgraded our application to pyramid_oereb 2.4.8. pyramid_oereb_mfp uses the latest commit (8d9da5ab3a8d1a8469fd693e4f86c5870ee9f6dc). At the same time, we have upgraded mapfish_print - as recommended in https://github.com/openoereb/pyramid_oereb_mfp/blob/8d9da5ab3a8d1a8469fd693e4f86c5870ee9f6dc/docker-compose.yml#L8 to version 3.30.2. However, this constellation causes problems in our test environment. Everything actually works perfectly - with the exception of PDF creation. This is not running stable. On the one hand, certain complex extracts (large geometries, many PLRs) are not created at all or only very rarely (HTTP error 500) and on the other hand, a less complex call sometimes results in an HTTP error 500 if it is called several times in succession. In the application log file, we observe errors such as

12:41:41.866 [http-nio-8080-exec-9] WARN o.m.print.servlet.BaseMapServlet - Error while processing request: java.lang.NullPointerException: Inflater has been closed

I have attached all log messages of an erroneous request: single_failed_request.log and here is a log of multiple successful and failed requests: reporting_oereb2_test.zip

Our application runs in a Kubernetes cluster. The mapfish_print pod has 1.8 CPU and 2.2 GB RAM assigned to it. The image is configured with -Xmx512M -Dmapfish.maxContentLength=8388608

In the production environment (pyramid_oereb 2.4.4 and mapfish_print 3.29.4) the pods run with exactly the same resources and these problems do not occur. My question now: have other cantons also observed this behavior? I assume that at least @michmuel has observed something similar (https://github.com/openoereb/pyramid_oereb/pull/1814#issuecomment-1711359398). Does anyone know what these errors mean? Lack of resources? Other causes?

michmuel commented 9 months ago

It is true, we also recognized that the image 3.30 runs less stable in Openshift than the image 3.29. I performed some local tests but have not managed to reproduce the error-prone behavior. So we still run the image 3.29.

@svamaa You run the image 3.30 with restarting the container after each print. Is that correct?

ghost commented 9 months ago

We run the image 3.30.6 with very high resource limits (7G) in Kubernetes. We've configured the restart of the pods after 8G of RAM was used.

michmuel commented 9 months ago

We run the image 3.29 with 500m CPU and 1100Mi Memory and restart the pods every 2 hours. My impression is that the image requires quite a lot of resources for its task. Is there something related to this image that might cause some unwanted behaviour in Openshift? How could we proceed @peterschaer, @jwkaltz ?

peterschaer commented 8 months ago

Sorry for the late reply, I was busy with other things. Anyway: @jwkaltz is there a reason why mapfish_print suddenly needs so many more resources? And is there a recommendation on how many resources (especially RAM) mapfish_print needs? That would help us a lot.

jwkaltz commented 8 months ago

@peterschaer I will discuss it with the main contributors of MapFish-Print. I do note that this phenomenon does not occur in most projects using MapFish-Print, there are reports here from you and Basel-Land, it seems the other cantons do not experience this with pyramid_oereb, and we also did not notice this effect in the geoportals running MapFish-Print. We do have a report from Basel-Stadt's geoportal about a memory leak, this is however already the case in older versions of MapFish-Print.

Furthermore, please note that MapFish-Print is currently undergoing big maintenance tasks in order to update the libraries being used and to facilitate its debugging. I do hope that this will help us to clarify what is going on in your (and Basel-Land's) use case.

jwkaltz commented 3 months ago

@peterschaer @michmuel can you please confirm that, in your environment, you have more problems with MapFish-Print 3.30 than with 3.29, and can you provide any more details? There is ongoing work to improve MapFish-Print in general, but if you have anything more specific we should open a new issue in https://github.com/mapfish/mapfish-print

peterschaer commented 2 months ago

@jwkaltz I can confirm that we have more problems with MapFish-Print 3.30 than with 3.29. At the moment I do not have more details as I have already listed in the initial post of this issue.

lopo977 commented 2 months ago

@jwkaltz we have still problem, with the new MPF 3.30.11 configured with -Xmx2G -Dmapfish.maxContentLength=8388608

Allocated RAM has reached almost 4 GB and the we have strange behaviors:

CH657302880741-22.pdf

jwkaltz commented 2 months ago

@lopo977 can you please provide, as a file, the print specification that MapFish-Print is processing when you have errors? This should be visible in the MapFish-Print logs. This way, we can try to reproduce the error in a local development environment.

lopo977 commented 2 months ago

@lopo977 can you please provide, as a file, the print specification that MapFish-Print is processing when you have errors? This should be visible in the MapFish-Print logs. This way, we can try to reproduce the error in a local development environment.

Here we are... this file is sometimes processed correctly in 3.30.11 and randomly MFP says that we have an error:.... spec.json

lopo977 commented 2 months ago

After some times something of weird happens... i.e. out.log

in this case

Caused by: java.lang.NullPointerException: Inflater has been closed

931c3587-9cf0-49dc-a11a-f61a547446c3@0b960d38-fd27-445e-890e-d44bde71c413
java.util.concurrent.ExecutionException: java.lang.NullPointerException
    at java.base/java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1006)
    at org.mapfish.print.output.AbstractJasperReportOutputFormat.getJasperPrint(AbstractJasperReportOutputFormat.java:160)
    at org.mapfish.print.output.AbstractJasperReportOutputFormat.print(AbstractJasperReportOutputFormat.java:104)
    at org.mapfish.print.MapPrinter.print(MapPrinter.java:133)
    at org.mapfish.print.servlet.job.PrintJob.lambda$call$0(PrintJob.java:150)
    at org.mapfish.print.servlet.job.PrintJob.withOpenOutputStream(PrintJob.java:120)
    at org.mapfish.print.servlet.job.PrintJob.call(PrintJob.java:149)
    at org.mapfish.print.servlet.job.PrintJob.call(PrintJob.java:56)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException: null
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:603)
    ... 12 common frames omitted
Caused by: java.lang.NullPointerException: null
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:603)
    at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
    at java.base/java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:722)
    at org.mapfish.print.processor.ProcessorDependencyGraph.tryExecuteNodes(ProcessorDependencyGraph.java:63)
    at org.mapfish.print.processor.ProcessorDependencyGraph$ProcessorGraphForkJoinTask.lambda$compute$0(ProcessorDependencyGraph.java:212)
    at org.mapfish.print.processor.AbstractProcessor$Context.mdcContext(AbstractProcessor.java:248)
    at org.mapfish.print.processor.ProcessorDependencyGraph$ProcessorGraphForkJoinTask.compute(ProcessorDependencyGraph.java:208)
    at org.mapfish.print.processor.ProcessorDependencyGraph$ProcessorGraphForkJoinTask.compute(ProcessorDependencyGraph.java:185)
    at java.base/java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.NullPointerException: Inflater has been closed
    at java.base/java.util.zip.Inflater.ensureOpen(Inflater.java:740)
    at java.base/java.util.zip.Inflater.inflate(Inflater.java:377)
    at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:153)
    at java.base/java.util.zip.GZIPInputStream.read(GZIPInputStream.java:118)
    at java.base/java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.apache.http.client.entity.LazyDecompressingInputStream.read(LazyDecompressingInputStream.java:64)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1309)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:978)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1282)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:953)
    at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpResponse.<init>(HttpRequestFetcher.java:96)
    at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpResponse.<init>(HttpRequestFetcher.java:81)
    at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpRequest.lambda$call$0(HttpRequestFetcher.java:212)
    at org.mapfish.print.processor.AbstractProcessor$Context.mdcContextEx(AbstractProcessor.java:269)
    at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpRequest.call(HttpRequestFetcher.java:204)
    at org.mapfish.print.http.HttpRequestFetcher$CachedClientHttpRequest.call(HttpRequestFetcher.java:138)
    at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1448)
    ... 5 common frames omitted
michmuel commented 2 months ago

@jwkaltz

I tested MPF 3.30.11 in OpenShift by repeatedly calling the pdf for the same real estate. After some successful requests MFP runs into a time-out. I did not observe this behavior with MFP 3.29.

log-27-Sep-2024-edited.txt

lopo977 commented 2 months ago

@jwkaltz

I tested MPF 3.30.11 in OpenShift by repeatedly calling the pdf for the same real estate. After some successful requests MFP runs into a time-out. I did not observe this behavior with MFP 3.29.

log-27-Sep-2024-edited.txt

The anomalous behaviors of MFP happen, as time passes and the allocated memory always grows. At some point, I think, when it can't allocate more memory, the strange behaviors start. Not related to a specific egrid. With particularly complex funds maybe you get to a critical memory allocation in less time. The anomalies are the same reported by the other authors in this post(issue). I suspect a memory leak.

michmuel commented 1 month ago

@jwkaltz Are new information available concerning the image 3.30.11? Are there known issues? Does it run stably at Swisstopo?

jwkaltz commented 1 month ago

@lopo977 @michmuel can you please retest with the latest release, v3.31.1 ?

[UPDATE] please wait when testing. When I test it locally, I get an error, I am checking with the developers.

jwkaltz commented 1 month ago

@lopo977 @michmuel the version v3.31.1 is broken, but this morning it was fixed in a new release v3.31.2, I have tested it locally with the development data of pyramid_oereb and the PDF generation via MapFish-Print is working with this release. Can you please test this release in your environment? Please note that this release includes significant enhancements in metrics gathering and in healthcheck status visibilty. Notably, you can include this healthcheck reporting in your Kubernetes supervision if you like. This data is available at (of course, you may need to adapt the localhost:8680 part depending on your setup): http://localhost:8680/metrics/metrics http://localhost:8680/metrics/healthcheck

lopo977 commented 1 month ago

Running MPF v3.31.2 Our system is experiencing slower response times and a high number of timeouts.

[UPDATE] It seems there was an issue with the Geodienste WMS. I have now set the version to test.

lopo977 commented 4 weeks ago

Running MPF v3.31.2 appears to be unstable. After less than 6h, it doesn't work anymore... with the following error

16:29:23.996 [ForkJoinPool-1-worker-251] INFO  o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.AddStaticLayersProcessor' was 0 ms
16:29:23.997 [ForkJoinPool-1-worker-251] INFO  o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.AddStaticLayersProcessor' was 0 ms
16:29:23.997 [ForkJoinPool-1-worker-251] INFO  o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.SetStyleProcessor' was 0 ms
16:29:23.997 [ForkJoinPool-1-worker-251] INFO  o.m.p.processor.ProcessorGraphNode - Time taken to run processor: 'class org.mapfish.print.processor.map.SetFeaturesProcessor' was 0 ms
16:29:24.003 [ForkJoinPool-1-worker-251] INFO  o.m.print.map.image.wms.WmsLayer - Query the WMS image https://geodienste.ch/db/av_situationsplan_oereb_0/deu?TRANSPARENT=true&FORMAT=image%2Fpng&STYLES=&WIDTH=1740&HEIGHT=988&LAYERS=daten&REQUEST=GetMap&SRS=EPSG%3A2056&BBOX=2721674.0670634615%2C1116438.5887307692%2C2721948.2869365388%2C1116594.3322692309&VERSION=1.1.1&SERVICE=WMS.
16:29:24.616 [http-nio-8080-exec-1] WARN  o.m.print.servlet.BaseMapServlet - Error while processing request: task canceled (timeout)
16:29:27.164 [http-nio-8080-exec-1] INFO  o.m.p.s.j.impl.ThreadPoolJobManager - Submitted print job 6a10e91f-6034-4eb3-9267-f0066234dae3@2b8a841d-bd06-4408-a1cf-4f206a81ae14
sebr72 commented 4 weeks ago

@lopo977 In 3.31.x There was quite a lot of changes in the manner the logs are written. In your example, the Logger level is INFO and WARN, not ERROR. The warning you mentionned is about something that went wrong, but not an error that's stops MFP to run as usual. Amongst 3.31.x changes: MFP is no longer hiding exceptions and the root cause of problems. Hence the number of logs is not relevant. Could you clarify what you meant by more unstable ? It would also help a lot if you shared the output of the metrics and healthcheck endpoints mentioned by @jwkaltz .

lopo977 commented 4 weeks ago

@sebr72 @jwkaltz Running MPF v3.31.2 ...The status after some time (not so much).. CPU load near 0 MEM 1.46 GB NET load near 0

Request/response cycles: https://cloud.geo.ti.ch/oereb2/extract/pdf/?EGRID=CH568802070726 504 Gateway Time-out The server didn't respond in time.

https://cloud.geo.ti.ch/oereb2/extract/pdf/?EGRID=CH540702893564 504 Gateway Time-out The server didn't respond in time.

https://cloud.geo.ti.ch/oereb2/extract/pdf/?EGRID=CH769002071095 504 Gateway Time-out The server didn't respond in time.

[metrics] metrics.json

healthcheck.json

[log] mfp.log

I cannot put this version in production. Version 3.29 works very well.

I have to revert the production system to the mfp 3.29. I am sorry. I produced all feedback I have in this moment.

It seems that jobs are queued (looking at healthcheck.json) "WIP: Number of print jobs queued is above threshold: 4. Number of print jobs waiting is 13"

Could you help us please?

Regards.

lopo977 commented 2 weeks ago

Good afternoon

Mr Riollet. @sebr72

As agreed, I will send the mfp code and 3 spec files that you can paste into the form http://localhost:8680/

if you then push the “create and get print” button a few times, on 3 different tabs, with the 3 specs in 3 forms, after a few times the program stalls.

To start mfp type: $ make serve

Thank you for supporting us. attachments.zip

Regards, Luca