hotosm / osma-health

HOT Analytics for Health
12 stars 3 forks source link

Infrastructure outline #2

Closed geohacker closed 6 years ago

geohacker commented 6 years ago

We decided to use AWS Batch as a way to easily managed schedule TileReduce jobs. @kamicut and @geohacker are working building an outline.

3 ER

geohacker commented 6 years ago

@kamicut and I pulled together a full cloudformation example of using AWS Batch with spot instances and an osmlint task that pulls qa tiles and pushes results to S3. https://github.com/developmentseed/aws-batch-example

We'll base our current work on this. The worker will live in osma-health-worker.

geohacker commented 6 years ago

To outline why we're diverging from the current OSM Analytics cruncher infrastructure:

awright commented 6 years ago

May I asked a question here? Can you explain (in our totally meeting like 2 minutes away) how and when OSMesa will come into the picture? I am trying to get a sense of the entire flow and architecture of things.

Thanks!

On Mon, Mar 26, 2018 at 3:20 AM, Sajjad Anwar notifications@github.com wrote:

Closed #2 https://github.com/hotosm/osma-health/issues/2.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hotosm/osma-health/issues/2#event-1540166469, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEySmlVI9sZ8oqlxWgLr3gTbKc0eE5Uks5tiJajgaJpZM4SqInd .

kamicut commented 6 years ago

@smit1678 @awright @moradology here's the outline for the infrastructure. cc @geohacker Where would you like to add this for documentation? Let us know what you think!

OSM Analytics For Health Infrastructure

Background

OSM Analytics for Health aims to help field-based, academic and governmental organizations to improve their prevention strategies by tracking where the map is incomplete. hotosm/osma-health is a web application developed by HOT and Development Seed to assess the quality and accuracy of OpenStreetMap data.

By combining Worldpop and completely mapped areas in OSM, we can train a model to estimate gaps in building density. We overlay this with other metrics to provide a report of coverage area.

The purpose of this document is to outline the metrics required by the application and the underlying infrastructure that produces them. At a high level our approach involves a periodic generation of vector spatial datasets from primary sources.

Infrastructure Requirements

Data sources Sources of data that will be used to generate the metrics and the map layers

Derived metrics Metrics displayed alongside an area of interest

Map layers

Our approach will be two-fold, using a one-time job to generate the “relative completeness” metric and associated tile layer, alongside periodic jobs to generate the other metrics. The output of these jobs is spatial vector data stored in AWS S3, either in GeoJSON or Mapbox Vector Tile format.

osma-health architecture

One-time ML job Azavea is leading the task of building a relative completess metric for a given area of interest. Given WorldPop and the OSM QA tiles, a machine learning training process will generate a model that can fit population counts to OSM building coverage. It will then output geojson for each tile at zoom 12. These tiles will contain:

The last ratio is the measure of relative completeness. In perfectly mapped areas, it tends to 1, and in poor coverage areas it is less than 1. This 0 to 1 scale can be used for a heatmap layer.

Periodic batch jobs For the other metrics, Development Seed is building an AWS Batch pipeline that takes in WorldPop and the OSM QA tiles and generates vector data. The AWS Batch pipeline is triggered weekly using a scheduled AWS Lambda function. At that time, a job will be scheduled for each country that covers the areas of interest. The underlying cluster for the jobs are spot instances that scale up to meet the demands of the batch then terminate at the end of all jobs in that batch.

The batch jobs will each trigger a series of OSM Lint and aggregation tasks. The HOT organization has forked the osmlint repository to add additional tasks that suit osma-health's purpose.

Measuring Success

Our measure for success is that this low maintenance infrastructure allows for osma-health to expand beyond the inital metrics. Additionally, the flexibility of this approach allows for integration with the existing OSM Analytics code. By benchmarking the cost and performance of osma-health infrastructure we hope it will be suitable as an underlying batch framework for the metrics in OSM Analytics.

awright commented 6 years ago

This is great! (I especially like the isometric implementation overview.) I'll defer (but also think) about where best to put the doc too. Thanks!

On Thu, Mar 29, 2018 at 5:37 PM, Marc Farra notifications@github.com wrote:

@smit1678 https://github.com/smit1678 @awright https://github.com/awright @moradology https://github.com/moradology here's the outline for the infrastructure. cc @geohacker https://github.com/geohacker Where would you like to add this for documentation? Let us know what you think! OSM Analytics For Health Infrastructure Background

OSM Analytics for Health aims to help field-based, academic and governmental organizations to improve their prevention strategies by tracking where the map is incomplete. hotosm/osma-health is a web application developed by HOT and Development Seed to assess the quality and accuracy of OpenStreetMap data.

By combining Worldpop and completely mapped areas in OSM, we can train a model to estimate gaps in building density. We overlay this with other metrics to provide a report of coverage area.

The purpose of this document is to outline the metrics required by the application and the underlying infrastructure that produces them. At a high level our approach involves a periodic generation of vector spatial datasets from primary sources. Infrastructure Requirements

  • Periodic updating of metrics
  • Use of AWS technologies
  • Static frontend with minimal API infrastructure

Data requirements

Data sources Sources of data that will be used to generate the metrics and the map layers

Derived metrics Metrics displayed alongside an area of interest

  • Overall quality indicator: A qualitative measure of the completeness of the area of interest.
  • Last time of update: When was the report last generated?
  • Estimated population: Population in the area of interest
  • Relative completeness: A quantitative measure of completeness based on Worldpop and building density
  • Attribute completeness: A measure of what percentage of missing tags such as ‘residential building’ in OSM building data
  • Recency of edits: A histogram of how fresh the data in the area of interest is
  • Number of duplicate buildings: The number of buildings that were mapped multiple times
  • Logical consistency errors: The number of features that are misaligned or overlapping illogically with other features

Map layers

  • Area of interest geometry: A bounding perimeter around the report area
  • Recency layer: A spatial gradient layer that displays recency of data
  • Completeness layer: A spatial category layer that displays relative completeness

Implementation Overview

Our approach will be two-fold, using a one-time job to generate the “relative completeness” metric and associated tile layer, alongside periodic jobs to generate the other metrics. The output of these jobs is spatial vector data stored in AWS S3, either in GeoJSON or Mapbox Vector Tile format.

[image: osma-health architecture] https://camo.githubusercontent.com/61e807c415e5294910a9c0efd6ab64ff8993377e/68747470733a2f2f6b616d696375742d6d6f6e6f736e61702e73332e616d617a6f6e6177732e636f6d2f312e5f626173685f323031382d30332d32385f31372d33342d30392e706e67

One-time ML job Azavea is leading the task of building a relative completess metric for a given area of interest. Given WorldPop and the OSM QA tiles, a machine learning training process will generate a model that can fit population counts to OSM building coverage. It will then output geojson for each tile at zoom 12. These tiles will contain:

  • Estimated population
  • Actual OSM building coverage
  • Expected building coverage
  • A ratio of projected population to worldpop estimate

The last ratio is the measure of relative completeness. In perfectly mapped areas, it tends to 1, and in poor coverage areas it is less than 1. This 0 to 1 scale can be used for a heatmap layer.

Periodic batch jobs For the other metrics, Development Seed is building an AWS Batch pipeline that takes in WorldPop and the OSM QA tiles and generates vector data. The AWS Batch pipeline is triggered weekly using a scheduled AWS Lambda function. At that time, a job will be scheduled for each country that covers the areas of interest. The underlying cluster for the jobs are spot instances that scale up to meet the demands of the batch then terminate at the end of all jobs in that batch.

The batch jobs will each trigger a series of OSM Lint https://github.com/osmlab/osmlint and aggregation tasks. The HOT organization has forked the osmlint repository to add additional tasks that suit osma-health's purpose. Measuring Success

Our measure for success is that this low maintenance infrastructure allows for osma-health to expand beyond the inital metrics. Additionally, the flexibility of this approach allows for integration with the existing OSM Analytics code. By benchmarking the cost and performance of osma-health infrastructure we hope it will be suitable as an underlying batch framework for the metrics in OSM Analytics.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hotosm/osma-health/issues/2#issuecomment-377380149, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEyShiP8XqT1E3ddD4burVrbr4NDIMBks5tjVQBgaJpZM4SqInd .

geohacker commented 6 years ago

Thank you for outlining this clearly @kamicut!

geohacker commented 6 years ago

Moved this to Readme https://github.com/hotosm/osma-health/blob/master/README.md

kamicut commented 6 years ago

@geohacker @awright I'd love to somehow keep the "Measuring Success" somewhere so that we don't forget about it. It's what will allow for community adoption of this type of infrastructure 😃