kartoza / geosafe

InaSAFE package for Geonode
GNU General Public License v3.0
7 stars 16 forks source link

Add diagnostic details for Analysis Task #523

Closed lucernae closed 5 years ago

lucernae commented 5 years ago

Summary

This PR is used to address #520 #518 . This PR is not meant to solve the problem (it requires a great deal of exception handling architecture between InaSAFE Headless and InaSAFE). This is however meant as intermediary step towards fixing the problem.

The motives:

This is in no way solves the problem completely as of now. But the aim is to publish this to prod server so we can replicate the issue now with more context of what kind of analysis was being processed in the server.

lucernae commented 5 years ago

Screenshot

There is a new column in Analysis List, which links to the diagnostic details:

screenshot_2019-01-24 example com 1

If we click it, the page will look like this:

screenshot_2019-01-24 example com

The analysis detail page will have a series of cards at the right, with more info:

We also have a map at the center because obviously we want to see the layer itself, and the analysis extent.

Below, is some example of successful Analysis Detail: screenshot_2019-01-24 example com 2

lucernae commented 5 years ago

The layout and the entire suggestion text are open to feedback (it can be easily changed in the code).

codecov[bot] commented 5 years ago

Codecov Report

Merging #523 into 2.8.x will increase coverage by 0.39%. The diff coverage is 66.41%.

Impacted file tree graph

@@            Coverage Diff            @@
##            2.8.x    #523      +/-   ##
=========================================
+ Coverage   56.21%   56.6%   +0.39%     
=========================================
  Files          48      49       +1     
  Lines        2576    2703     +127     
  Branches      284     298      +14     
=========================================
+ Hits         1448    1530      +82     
- Misses       1051    1089      +38     
- Partials       77      84       +7
Impacted Files Coverage Δ
celery_app.py 100% <ø> (ø)
__init__.py 100% <100%> (ø) :arrow_up:
default_settings.py 100% <100%> (ø) :arrow_up:
admin/admin.py 100% <100%> (ø) :arrow_up:
tasks/analysis.py 57.92% <29.41%> (-1.42%) :arrow_down:
models.py 65.11% <60.86%> (-2.01%) :arrow_down:
helpers/suggestions/error_suggestions.py 70.58% <70.58%> (ø)
signals.py 85.93% <0%> (-3.13%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4ca115e...cfae77c. Read the comment docs.

gubuntu commented 5 years ago

nice work @lucernae LGTM

your inputs @cgiovando and @timlinux ?

cgiovando commented 5 years ago

Thanks @lucernae - this looks good and definitely an improvement. I'm wondering if we could also try to address the root issue, and better understand and mitigate why all these analyses have been failing? Is it hardware limitations, errors in the source data or something with the InaSAFE functions?

lucernae commented 5 years ago

@cgiovando This is in no way the fix. But we need to deploy this to address the root issue. What this PR means is to collect the relevant information in prod instance of what causing the analysis to fail. When I check the analyses earlier this month, I only know that analyses is currently running but restarted everyday for some reason. That's why it never completes. This could be caused by many reason, such as hardware limitation as you say (memory constraints), or the raster data is too big to process in a day. We have no way of knowing this since the diagnostics data and worker is purged everyday for efficiency and reliability (in case memory constraint occurs and it can complete, it has to restart automatically). So, in other words, we have to restart the analysis with this PR deployed in prod to answer this:

Is it hardware limitations, errors in the source data or something with the InaSAFE functions?

But my initial guess is in the hardware limitation (the source layer is just too big to process). But the system doesn't have a way to propagate the error if the task is killed while it is still processing (because if it is still processing, it doesn't encounter error yet). However if the cause is really coming from the code, then the task will stop at a time and we have a way to persist the error reported by the task now.

lucernae commented 5 years ago

In this PR, I was asking for feedback for the layout/UI side of things, since I had to deploy this first anyway to prod. If the layout/UI are ok. Then the next PR will just focus on actually addressing the issue depending on the cause we found.