The data and code in this repository allows users to generate figures appearing in the main text of the paper Combining satellite imagery and machine learning to predict poverty (except for Figure 2, which is constructed from specific satellite images). Paper figures may differ aesthetically due to post-processing.
Code was written in R 3.2.4 and Python 2.7.
Users of these data should cite Jean, Burke, et al. (2016). If you find an error or have a question, please submit an issue.
We are no longer maintaining this project, but will link to related projects as we learn of them.
Pytorch implementation: https://github.com/jmather625/predicting-poverty-replication
R
The user can run the following command to automatically install the R packages
install.packages(c('R.utils', 'magrittr', 'foreign', 'raster', 'readstata13', 'plyr', 'RColorBrewer', 'sp', 'lattice', 'ggplot2', 'grid', 'gridExtra'), dependencies = T)
Python
Caffe and pycaffe
We recommend using the open data science platform Anaconda.
Due to data access agreements, users need to independently download data files from the World Bank's Living Standards Measurement Surveys and the Demographic and Health Surveys websites. These two data sources require the user to fill in a Data User Agreement form. In the case of the DHS data, the user is also required to register for an account.
For all data processing scripts, the user needs to set the working directory to the repository root folder.
UPDATE (08/02/2017): The LSMS website has apparently recently removed two files from their database which contain crucial consumption aggregates for Uganda 2011-12 and Malawi 2013. Since we are not at liberty to share those files ourselves, this would inhibit replication of consumption analysis in those countries. We have reached out and will update this page according to their response.
UPDATE (08/03/2017): The LSMS has informed us these files were inadvertently removed and will be restored unchanged as soon as possible.
3. Unzip these files so that **data/input/LSMS** contains the following folders of data:
1. UGA_2011_UNPS_v01_M_STATA
2. TZA_2012_LSMS_v01_M_STATA_English_labels
3. DATA (formerly NGA_2012_LSMS_v03_M_STATA before a re-upload in January 2016)
4. MWI_2013_IHPS_v01_M_STATA
Download DHS data
Unzip these files so that data/input/DHS contains the following folders of data:
(Note that the names of these folders may vary slightly depending on the date the data is downloaded)
Download the parameters of the trained CNN model here and save in the model directory.
Generate candidate locations to download using get_image_download_locations.py
. This will generate locations meant to download 1x1 km RGB satellite images of size 400x400 pixels. For most of the countries, locations for about 100 images in a 10x10 km area around the cluster is generated. For Nigeria and Tanzania, we generate about 25 evenly spaced points in the 10x10 km area. The result of running this is a file for each country, for each dataset named candidate_download_locs.txt
, in the following format for every line:
[image_lat] [image_long] [cluster_lat] [cluster_long]
For example, a line in this file may be
4.163456 6.083456 4.123456 6.123456
Note that this requires GDAL and previously running DownloadPublicData.R
.
Download imagery from locations of interest (e.g., cluster locations from Nigeria DHS survey). In this process, successfully downloaded images must then have a corresponding line in a output metadata file named downloaded_locs.txt
(e.g., data/output/DHS/nigeria/downloaded_locs.txt). There will be one of these metadata files for each country. The format for each line of the metadata file must be, for each line:
[absolute path to image] [image_lat] [image_long] [cluster_lat] [cluster_long]
For example, a line in this file may be
/abs/path/to/img.jpg 4.163456 6.083456 4.123456 6.123456
Note that the last 4 fields in each line should be copied from the candidate_download_locs.txt
file for each (country, dataset) pair.
Extract cluster features from satellite images using extract_features.py
. This will require installation of Caffe and pycaffe (see Caffe Installation). This may also require setting pycaffe in your PYTHONPATH. In each country's data folder (e.g., data/output/DHS/nigeria/) we save two Numpy arrays: conv_features.npy
and image_counts.npy
. This process will be much faster if a sizable GPU is used, with GPU=True
set in extract_features.py
.
For all data processing scripts, the user needs to set the working directory to the repository root folder. If reproducing all figures, the user does not need to rerun the data processing scripts or the image feature extraction process (steps 1-2 for Fig. 1, steps 1-6 for Figs. 3-5).
To generate Figure 1, the user needs to run
To generate Figure 3, the user needs to run
To generate Figure 4, the user needs to run
To generate Figure 5, the user needs to run