lucernae commented 5 years ago

Summary

This PR is used to address #520 #518 . This PR is not meant to solve the problem (it requires a great deal of exception handling architecture between InaSAFE Headless and InaSAFE). This is however meant as intermediary step towards fixing the problem.

The motives:

Provide clear feedback for user, to what really happens in the background. Does an anlysis task runs? Does it complete?
Provides a way for developers to know what is causing the error in production instance.
Provides a way for developers to add suggestions on what to do for the user to avoid the problem (if it can't be fixed)
With a clear feedback and context of what's happening we can really decide if some bug was raised from which part of the system

This is in no way solves the problem completely as of now. But the aim is to publish this to prod server so we can replicate the issue now with more context of what kind of analysis was being processed in the server.

lucernae commented 5 years ago

Screenshot

There is a new column in Analysis List, which links to the diagnostic details:

screenshot_2019-01-24 example com 1

If we click it, the page will look like this:

screenshot_2019-01-24 example com

The analysis detail page will have a series of cards at the right, with more info:

Analysis parameter summary, what is the exposure, hazard, to what extent
The suggestion. For example we have WorkerLost Error suggestion now. In the future, we add more suggestion depending of actual issues found on prod instance. This suggestion card will show up only for specific error contexts. For now, we decide the appropriate context programmatically (no fancy AI stuff).
The task diagnostics. This info will be useful mostly for developers. It will show the Exception class and the stacktrace if it gets caught.
Analysis related product and links to it. E.g. the impact layer, the reports.
Action button, to easily rerun analysis or cancel them (deleting the analysis)

We also have a map at the center because obviously we want to see the layer itself, and the analysis extent.

Below, is some example of successful Analysis Detail: screenshot_2019-01-24 example com 2

lucernae commented 5 years ago

The layout and the entire suggestion text are open to feedback (it can be easily changed in the code).

codecov[bot] commented 5 years ago

Codecov Report

Merging #523 into 2.8.x will increase coverage by 0.39%. The diff coverage is 66.41%.

@@            Coverage Diff            @@
##            2.8.x    #523      +/-   ##
=========================================
+ Coverage   56.21%   56.6%   +0.39%     
=========================================
  Files          48      49       +1     
  Lines        2576    2703     +127     
  Branches      284     298      +14     
=========================================
+ Hits         1448    1530      +82     
- Misses       1051    1089      +38     
- Partials       77      84       +7

Impacted Files	Coverage Δ
celery_app.py	`100% <ø> (ø)`
__init__.py	`100% <100%> (ø)`	:arrow_up:
default_settings.py	`100% <100%> (ø)`	:arrow_up:
admin/admin.py	`100% <100%> (ø)`	:arrow_up:
tasks/analysis.py	`57.92% <29.41%> (-1.42%)`	:arrow_down:
models.py	`65.11% <60.86%> (-2.01%)`	:arrow_down:
helpers/suggestions/error_suggestions.py	`70.58% <70.58%> (ø)`
signals.py	`85.93% <0%> (-3.13%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4ca115e...cfae77c. Read the comment docs.

gubuntu commented 5 years ago

nice work @lucernae LGTM

your inputs @cgiovando and @timlinux ?

cgiovando commented 5 years ago

Thanks @lucernae - this looks good and definitely an improvement. I'm wondering if we could also try to address the root issue, and better understand and mitigate why all these analyses have been failing? Is it hardware limitations, errors in the source data or something with the InaSAFE functions?

lucernae commented 5 years ago

@cgiovando This is in no way the fix. But we need to deploy this to address the root issue. What this PR means is to collect the relevant information in prod instance of what causing the analysis to fail. When I check the analyses earlier this month, I only know that analyses is currently running but restarted everyday for some reason. That's why it never completes. This could be caused by many reason, such as hardware limitation as you say (memory constraints), or the raster data is too big to process in a day. We have no way of knowing this since the diagnostics data and worker is purged everyday for efficiency and reliability (in case memory constraint occurs and it can complete, it has to restart automatically). So, in other words, we have to restart the analysis with this PR deployed in prod to answer this:

Is it hardware limitations, errors in the source data or something with the InaSAFE functions?

But my initial guess is in the hardware limitation (the source layer is just too big to process). But the system doesn't have a way to propagate the error if the task is killed while it is still processing (because if it is still processing, it doesn't encounter error yet). However if the cause is really coming from the code, then the task will stop at a time and we have a way to persist the error reported by the task now.

lucernae commented 5 years ago

In this PR, I was asking for feedback for the layout/UI side of things, since I had to deploy this first anyway to prod. If the layout/UI are ok. Then the next PR will just focus on actually addressing the issue depending on the cause we found.

kartoza / geosafe

Add diagnostic details for Analysis Task #523

Summary

Screenshot

Codecov Report