petewarden / dstk

A collection of the best open data sets and open-source tools for data science
http://www.datasciencetoolkit.org/
1.13k stars 184 forks source link

New ec2setup.txt suggestion and hints... #60

Open pjm4github opened 9 years ago

pjm4github commented 9 years ago

This is my version of ec2setup.txt that I modified to work on my own home grown Ubuntu 12.04 LTS instance.

Start with AMI # ami-3fec7956 (Ubuntu 12.04), 32GB (ec2-run-instances ami-3fec7956 -t m1.large --region us-east-1 -z us-east-1d --block-device-mapping /dev/sda1=:32:false -k )

sudo apt-add-repository -y ppa:olivier-berten/geo sudo add-apt-repository -y ppa:webupd8team/java sudo aptitude update sudo aptitude safe-upgrade -y sudo aptitude full-upgrade -y sudo aptitude install -y build-essential apache2 apache2.2-common apache2-mpm-prefork apache2-utils libexpat1 ssl-cert postgresql libpq-dev ruby1.8-dev ruby1.8 ri1.8 rdoc1.8 irb1.8 libreadline-ruby1.8 libruby1.8 libopenssl-ruby sqlite3 libsqlite3-ruby1.8 git-core libcurl4-openssl-dev apache2-prefork-dev libapr1-dev libaprutil1-dev subversion postgresql-9.1-postgis autoconf libtool libxml2-dev libbz2-1.0 libbz2-dev libgeos-dev proj-bin libproj-dev ocropus pdftohtml catdoc unzip ant openjdk-6-jdk lftp php5-cli rubygems flex postgresql-server-dev-9.1 proj libjson0-dev xsltproc docbook-xsl docbook-mathml gettext postgresql-contrib-9.1 pgadmin3 python-software-properties bison dos2unix sudo aptitude install -y oracle-java7-installer sudo aptitude install -y libgdal-dev sudo aptitude install -y libgeos++-dev sudo bash -c 'echo "/usr/lib/jvm/java-7-oracle/jre/lib/amd64/server" > /etc/ld.so.conf.d/jvm.conf' sudo ldconfig

Note that here you should create a new user called ubuntu (I used my own user and had to modify various scripts and config files which is described below)

mkdir ~/sources cd ~/sources wget http://download.osgeo.org/postgis/source/postgis-2.0.3.tar.gz tar xfvz postgis-2.0.3.tar.gz cd postgis-2.0.3 ./configure --with-gui

./configure --with-gui --without-topology

If the GEO version is incorrect then perform the following steps:

wget http://download.osgeo.org/geos/geos-3.3.8.tar.bz2 tar xjf geos-3.3.8.tar.bz2 cd geos-3.3.8 ./configure make sudo make install cd ~/sources/postgis-2.0.3 ./configure --with-gui

Note that the above steps didnt work. It appears that there should be a way to setup the load libraries correctly but I gave up.

otherwise continue here:

make sudo make install sudo ldconfig sudo make comments-install

sudo sed -i "s/ident/trust/" /etc/postgresql/9.1/main/pg_hba.conf sudo sed -i "s/md5/trust/" /etc/postgresql/9.1/main/pg_hba.conf sudo sed -i "s/peer/trust/" /etc/postgresql/9.1/main/pg_hba.conf sudo /etc/init.d/postgresql restart createdb -U postgres geodict

sudo -u postgres createdb template_postgis sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/postgis.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/spatial_ref_sys.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/postgis_comments.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/rtpostgis.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/raster_comments.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/topology.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/topology_comments.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/legacy.sql sudo -u postgres psql -d template_postgis -f /usr/share/postgresql/9.1/contrib/postgis-2.0/legacy_gist.sql

cd ~/sources git clone git://github.com/petewarden/dstk.git git clone git://github.com/petewarden/dstkdata.git cd dstk sudo gem install bundler sudo bundle install

cd ~/sources/dstkdata

If you want to save disk space and don't need geo-statistics, you can skip everything

up until the comment indicating the end of the geostats loading.

I SKIPPED TO %%%%%%%%% BELOW

createdb -U postgres -T template_postgis statistics

tar xzf statistics/gl_gpwfe_pdens_15_bil_25.tar.gz export PATH=$PATH:/usr/lib/postgresql/9.1/bin/ /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I gl_gpwfe_pdens_15_bil_25/glds15ag.bil public.population_density | psql -U postgres -d statistics rm -rf gl_gpwfe_pdens_15_bil_25 unzip statistics/glc2000_v1_1_Tiff.zip /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I Tiff/glc2000_v1_1.tif public.land_cover | psql -U postgres -d statistics rm -rf Tiff

sudo mkdir /mnt/data sudo chown pjm /mnt/data cd /mnt/data

The zip files are here: http://gis-lab.info/data/srtm-tif/, or here http://srtm.csi.cgiar.org/ or here https://hc.app.box.com/shared/1yidaheouv password = ThanksCSI!

sudo curl -O "http://static.datasciencetoolkit.org.s3-website-us-east-1.amazonaws.com/SRTM_NE_250m.tif.zip"

unzip SRTM_NE_250m.tif.zip

I got the TIF files from here instead!

sudo curl -O "https://hc.box.net/shared/1yidaheouv/SRTM_SE_250m_TIF.rar" unrar SRTM_NE_250m_TIF.rar /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 SRTM_NE_250m.tif public.elevation | psql -U postgres -d statistics rm -rf SRTM_NE_250m curl -O "http://static.datasciencetoolkit.org.s3-website-us-east-1.amazonaws.com/SRTM_W_250m.tif.zip" unzip SRTM_W_250m.tif.zip /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -a SRTM_W_250m.tif public.elevation | psql -U postgres -d statistics rm -rf unzip SRTM_W_250m curl -O "http://static.datasciencetoolkit.org.s3-website-us-east-1.amazonaws.com/SRTM_SE_250m.tif.zip" unzip SRTM_SE_250m.tif.zip /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -a -I SRTM_SE_250m.tif public.elevation | psql -U postgres -d statistics rm -rf SRTM_SE_250m*

curl -O "http://static.datasciencetoolkit.org.s3-website-us-east-1.amazonaws.com/tmean_30s_bil.zip" unzip tmean_30s_bil.zip /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_1.bil public.mean_temperature_01 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_2.bil public.mean_temperature_02 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_3.bil public.mean_temperature_03 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_4.bil public.mean_temperature_04 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_5.bil public.mean_temperature_05 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_6.bil public.mean_temperature_06 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_7.bil public.mean_temperature_07 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_8.bil public.mean_temperature_08 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_9.bil public.mean_temperature_09 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_10.bil public.mean_temperature_10 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_11.bil public.mean_temperature_11 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I tmean_12.bil public.mean_temperature12 | psql -U postgres -d statistics rm -rf tmean*

curl -O "http://static.datasciencetoolkit.org.s3-website-us-east-1.amazonaws.com/prec_30s_bil.zip" unzip prec_30s_bil.zip /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_1.bil public.precipitation_01 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_2.bil public.precipitation_02 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_3.bil public.precipitation_03 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_4.bil public.precipitation_04 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_5.bil public.precipitation_05 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_6.bil public.precipitation_06 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_7.bil public.precipitation_07 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_8.bil public.precipitation_08 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_9.bil public.precipitation_09 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_10.bil public.precipitation_10 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_11.bil public.precipitation_11 | psql -U postgres -d statistics /usr/lib/postgresql/9.1/bin/raster2pgsql -s 4236 -t 32x32 -I prec_12.bil public.precipitation12 | psql -U postgres -d statistics rm -rf prec*

unzip /home/pjm/sources/dstkdata/statistics/us_statisticsrasters.zip -d . for f in .tif; do raster2pgsql -s 4236 -t 32x32 -I $f basename $f .tif | psql -U postgres -d statistics; done rm -rf us_ rm -rf metadata

This is the end of the geostats loading, continue from here if you decide to skip that part.

%%%%%%%% START HERE AGAIN

sudo gem install passenger sudo passenger-install-apache2-module

You'll need to update the version number below to match whichever actual passenger version was installed

This is what the build said:

LoadModule passenger_module /var/lib/gems/1.8/gems/passenger-5.0.18/buildout/apache2/mod_passenger.so

PassengerRoot /var/lib/gems/1.8/gems/passenger-5.0.18

PassengerDefaultRuby /usr/bin/ruby1.8

I changed the passenger version in the lines below to match what was found from the lines above:

sudo bash -c 'echo "LoadModule passenger_module /var/lib/gems/1.8/gems/passenger-5.0.18/buildout/apache2/modpassenger.so" > /etc/apache2/mods-enabled/passenger.load' sudo bash -c 'echo "PassengerRoot /var/lib/gems/1.8/gems/passenger-5.0.18" > /etc/apache2/mods-enabled/passenger.conf' sudo bash -c 'echo "PassengerRuby /usr/bin/ruby1.8" >> /etc/apache2/mods-enabled/passenger.conf' sudo bash -c 'echo "PassengerMaxPoolSize 3" >> /etc/apache2/mods-enabled/passenger.conf' sudo sed -i "s/MaxRequestsPerChild[ \t][ \t][0-9][0-9]_/MaxRequestsPerChild 20/" /etc/apache2/apache2.conf

I needed to change the DocumentRoot to match the actual location where the data was installed. In my case the sources directory was /home/pjm/sources instead of /home/ubuntu/sources.

Ideally there should have been a new user called ubuntu but I didnt know about this until I was too far into the process.

sudo bash -c 'echo " <VirtualHost *:8000> ServerName 127.0.1.1 DocumentRoot /home/pjm/sources/dstk/public RewriteEngine On RewriteCond %{HTTPHOST} ^datasciencetoolkit.org$ [NC] RewriteRule ^(.)$ http://www.datasciencetoolkit.org$1 [R=301,L] RewriteCond %{HTTPHOST} ^datasciencetoolkit.com$ [NC] RewriteRule ^(.)$ http://www.datasciencetoolkit.com$1 [R=301,L] <Directory /home/pjm/sources/dstk/public> AllowOverride all Options -MultiViews " > /etc/apache2/sites-enabled/000-default' sudo ln -s /etc/apache2/mods-available/rewrite.load /etc/apache2/mods-enabled/rewrite.load

sudo /etc/init.d/apache2 restart

sudo gem install postgres -v '0.7.9.2008.01.28'

cd ~/sources/dstk ./populate_database.rb

cd ~/sources mkdir maxmind cd maxmind wget "http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz" gunzip GeoLiteCity.dat.gz wget "http://geolite.maxmind.com/download/geoip/api/c/GeoIP.tar.gz" tar xzvf GeoIP.tar.gz cd GeoIP-1.4.8/ libtoolize -f ./configure make sudo make install cd .. svn checkout svn://rubyforge.org/var/svn/net-geoip/trunk net-geoip cd net-geoip/ ruby ext/extconf.rb make sudo make install

cd ~/sources wget http://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.11.tar.gz tar -xvzf libiconv-1.11.tar.gz cd libiconv-1.11 ./configure --prefix=/usr/local/libiconv make sudo make install sudo ln -s /usr/local/libiconv/lib/libiconv.so.2 /usr/lib/libiconv.so.2

createdb -U postgres -T template_postgis reversegeo

cd ~/sources git clone git://github.com/petewarden/osm2pgsql cd osm2pgsql/ ./autogen.sh sed -i 's/version = BZ2_bzlibVersion();//' configure sed -i 's/version = zlibVersion();//' configure ./configure make sudo make install cd ..

osm2pgsql -U postgres -d reversegeo -p world_countries -S osm2pgsql/styles/world_countries.style dstkdata/world_countries.osm -l osm2pgsql -U postgres -d reversegeo -p admin_areas -S osm2pgsql/styles/admin_areas.style dstkdata/admin_areas.osm -l osm2pgsql -U postgres -d reversegeo -p neighborhoods -S osm2pgsql/styles/neighborhoods.style dstkdata/neighborhoods.osm -l

The above commands take several hours to complete

I started the next set of commands in a new window...

cd ~/sources git clone git://github.com/petewarden/boilerpipe cd boilerpipe/boilerpipe-core/ ant cd src javac -cp ../dist/boilerpipe-1.1-dev.jar boilerpipe.java

cd ~/sources/dstk/ psql -U postgres -d reversegeo -f sql/loadukpostcodes.sql

osm2pgsql -U postgres -d reversegeo -p uk_osm -S ../osm2pgsql/default.style ../dstkdata/uk_osm.osm.bz2 -l

psql -U postgres -d reversegeo -f sql/buildukindexes.sql

cd ~/sources git clone git://github.com/geocommons/geocoder.git cd geocoder make sudo make install

Build the latest Tiger/Line data for US address lookups

cd /mnt/data mkdir tigerdata cd tigerdata lftp ftp2.census.gov:/geo/tiger/TIGER2012/EDGES mirror --parallel=5 . cd ../FEATNAMES mirror --parallel=5 . cd ../ADDR mirror --parallel=5 . exit cd ~/sources/geocoder/build/ mkdir ../../geocoderdata/ ./tiger_import ../../geocoderdata/geocoder2012.db /mnt/data/tigerdata/

Completed to here

cd ~/sources git clone git://github.com/luislavena/sqlite3-ruby.git cd sqlite3-ruby ruby setup.rb config ruby setup.rb setup sudo ruby setup.rb install

cd ~/sources/geocoder bin/rebuild_metaphones ../geocoderdata/geocoder2012.db chmod +x build/build_indexes build/build_indexes ../geocoderdata/geocoder2012.db rm -rf /mnt/data/tigerdata

createdb -U postgres names cd /mnt/data curl -O "http://www.ssa.gov/oact/babynames/names.zip" dos2unix yob*.txt ~/sources/dstk/dataconversion/analyzebabynames.rb . > babynames.csv psql -U postgres -d names -f ~/sources/dstk/sql/loadnames.sql

Fix for postgres crashes,

sudo sed -i "s/shared_buffers = [0-9A-Za-z]*/shared_buffers = 512MB/" /etc/postgresql/9.1/main/postgresql.conf sudo sysctl -w kernel.shmmax=576798720 sudo bash -c 'echo "kernel.shmmax=576798720" >> /etc/sysctl.conf' sudo bash -c 'echo "vm.overcommit_memory=2" >> /etc/sysctl.conf' sudo sed -i "s/max_connections = 100/max_connections = 200/" /etc/postgresql/9.1/main/postgresql.conf sudo /etc/init.d/postgresql restart

Remove files not needed at runtime

rm -rf /mnt/data/* rm -rf ~/sources/libiconv-1.11.tar.gz rm -rf ~/sources/postgis-2.0.3.tar.gz cd ~/sources/ mkdir dstkdata_runtime mv dstkdata/ethnicityofsurnames.csv dstkdata_runtime/ mv dstkdata/GeoLiteCity.dat dstkdata_runtime/ rm -rf dstkdata mv dstkdata_runtime dstkdata

Up to this point, you'll have a 0.50 version of the toolkit.

The following will upgrade you to a 0.51 version

cd ~/sources/dstk git pull origin master

I found that the toolkit wass already uptodate

TwoFishes geocoder

cd ~/sources mkdir twofishes cd twofishes mkdir bin curl "http://www.twofishes.net/binaries/latest.jar" > bin/twofishes.jar mkdir data

The source link above is obsolete

curl "http://www.twofishes.net/indexes/revgeo/latest.zip" > data/twofishesdata.zip

This one might work... its unknown what was in latest.zip versus 2015-03-05.zip

curl "http://www.twofishes.net/indexes/revgeo/2015-03-05.zip" > data/twofishesdata.zip

The ~/sources/dstk/twofishd.sh must be edited to point to the new directory.

change

java -Xmx1500M -jar /home/ubuntu/sources/twofishes/bin/twofishes.jar --hfile_basepath /home/ubuntu/sources/twofishes/data/latest/

to this

java -Xmx1500M -jar /home/pjm/sources/twofishes/bin/twofishes.jar --hfile_basepath /home/pjm/sources/twofishes/data/2015-03-05-20-05-30.753698/

The entire ~/sources/dstk/ directory should be check to see if there is any reference to /home/ubuntu and renamed to point to /home/pjm instead

I looked through the dstk and found several instances like this:

cd ~/sources/dstk

grep '/home/ubuntu' *

geodict_daemon.rb:Daemons.run('/home/ubuntu/sources/dstk/dstk_server.rb', {

twofishes.conf:exec start-stop-daemon --start -c root --exec /home/ubuntu/sources/dstk/twofishesd.sh

twofishesd.sh:java -Xmx1500M -jar /home/ubuntu/sources/twofishes/bin/twofishes.jar --hfile_basepath /home/ubuntu/sources/twofishes/data/latest/

cd data unzip twofishesdata.zip

sudo cp ~/sources/dstk/twofishes.conf /etc/init/twofishes.conf sudo service twofishes start

Here is what the VirtualHost field looks like already

sudo bash -c 'echo "

<VirtualHost *:8000>

ServerName 127.0.1.1

DocumentRoot /home/pjm/sources/dstk/public

RewriteEngine On

RewriteCond %{HTTP_HOST} ^datasciencetoolkit.org$ [NC]

RewriteRule ^(.*)$ http://www.datasciencetoolkit.org$1 [R=301,L]

RewriteCond %{HTTP_HOST} ^datasciencetoolkit.com$ [NC]

RewriteRule ^(.*)$ http://www.datasciencetoolkit.com$1 [R=301,L]

<Directory /home/pjm/sources/dstk/public>

AllowOverride all

Options -MultiViews

" > /etc/apache2/sites-enabled/000-default'

This will be changed to this now:

sudo bash -c 'echo " <VirtualHost *:8000> ServerName 127.0.1.1 DocumentRoot /home/pjm/sources/dstk/public RewriteEngine On RewriteCond %{HTTPHOST} ^datasciencetoolkit.org$ [NC] RewriteRule ^(.)$ http://www.datasciencetoolkit.org$1 [R=301,L] RewriteCond %{HTTPHOST} ^datasciencetoolkit.com$ [NC] RewriteRule ^(.)$ http://www.datasciencetoolkit.com$1 [R=301,L]

We have an internal TwoFishes server running on port 8081, so redirect

  # requests that look like they belong to its API
  ProxyPass /twofishes http://localhost:8081  
  <Directory /home/pjm/sources/dstk/public>
     AllowOverride all
     Options -MultiViews
     Header set Access-Control-Allow-Origin "*"
     Header set Cache-Control "max-age=86400"
  </Directory>

" > /etc/apache2/sites-enabled/000-default' sudo ln -s /etc/apache2/mods-available/rewrite.load /etc/apache2/mods-enabled/rewrite.load sudo ln -s /etc/apache2/mods-available/proxy.load /etc/apache2/mods-enabled/proxy.load sudo ln -s /etc/apache2/mods-available/proxy_http.load /etc/apache2/mods-enabled/proxy_http.load sudo ln -s /etc/apache2/mods-available/headers.load /etc/apache2/mods-enabled/headers.load

sudo /etc/init.d/apache2 restart

I now go to http://192.168.0.5:8000 and I get the datasciencetoolkit webpage along with all the tools!! Nice!!

chapmanjacobd commented 5 years ago

I think something may be wrong with DSTK including this guide...

The SRID of all the imported datasets is 4236 but it should probably be 4326. The 3 and the 2 are switched.