New suggestions for labelled ML examples

jlstevens commented 5 years ago

In a recent meeting we (@ebo @jsignell @jbednar) came up with some new ideas for public labelled data that can be applied to public satellite imagery (which mostly implies LANDSAT data).

Good criteria for a task are that 1) all the data can be made public 2) the labelled features are big enough to spot with LANDSAT 3) the features can be easily spotted by a human to evaluate the ML performance. The two most promising suggestions were:

Using the National Inventory of Dams Database to mark dams on US imagery. This data has latitude/longitude data so the labels are points. There is one excel file per state and there are > 90k dams total.
Labelling lakes using the Global Lakes and Wetlands Database which is polygon data. The GLWD-2 dataset has > 250,000 polygons though this is a global database so I don't know how many fall in the US if we want to focus on that.

Another nice thing about these two datasets is that there is a good chance they are correlated with each other!

ebo commented 5 years ago

Thank you Jean-Luc for the post.

A couple of notes about the GLWD: it has been about 5 years since I worked with this dataset. The DB was a bit dated then, but was very good and fairly exhaustively captured every lake over about 0.1km^2 (if I recall correctly). You should find all but the smallest dams in the database, but the areal extent can be off for lakes which are shrinking or growing... Also, I thought it also had a field that identified it as dam, so I hope that you would get a near perfect correlation in the ID's.

Hope that helps.

EBo --

On Jul 24 2019 12:38 PM, Jean-Luc Stevens wrote:

In a recent meeting we (@ebo @jsignell @jbednar) came up with some new ideas for public labelled data that can be applied to public satellite imagery (which mostly implies LANDSAT data).

Good criteria for a task are that 1) all the data can be made public 2) the labelled features are big enough to spot with LANDSAT 3) the features can be easily spotted by a human to evaluate the ML performance. The two most promising suggestions were:

Using the [National Inventory of Dams

Database](https://nid.sec.usace.army.mil/ords/f?p=105:19:30889210318018::NO:::) to mark dams on US imagery. This data has latitude/longitude data so the labels are points. There is one excel file per state and there are

90k dams total.

Labelling lakes using the [Global Lakes and Wetlands

Database](https://www.worldwildlife.org/pages/global-lakes-and-wetlands-database) which is polygon data. The GLWD-2 dataset has > 250,000 polygons though this is a global database so I don't know how many fall in the US if we want to focus on that.

Another nice thing about these two datasets is that there is a good chance they are correlated with each other!

jbednar commented 3 years ago

We ended up with a simple example ; see https://examples.pyviz.org/landuse_classification/Image_Classification.html

holoviz-topics / EarthML

New suggestions for labelled ML examples #93