Shell script modifications to run remotely and add indexes

To use nppes-postgres in my environment I had to make some changes. I do not believe that these changes are broadly useful enough for a pull request so I am sharing them as a comment. Thank you @semerj for sharing the basic technique.

1) Modified shell script to run against remote postgres instance. Note - This is specific to my environment, it is not a good idea to do this remotely unless necessary! You have to transfer all 2.8GB of uncompressed NPPES data and that takes some time even on a fast network. 2) Added some indexes. Note that these are UPPER indexes. The npi column was indexed as both a numeric PK and as a varchar, my use case has me joining to another table where npi data is stored as a varchar.

1) -

#!/bin/sh

#TODO set this so that it just replicates the psql commands
# ./create_npi_db.sh [USERNAME] [DATABASE] [NPI-DATA] [TAXONOMY-DATA] [HOST]

#wget the nppes fulls to ./data
#wget the nucc taxonomy fulls to ./data

# clean data (per Readme.md)
# Replace empty "" integer fields in NPI CSV file
# interesting fact, this takes the NPPES file from 5.5GB to 2.8GB
sed 's/""//g' $3 > cleaned_npi.csv

# change taxonomy per readme.md
# Convert taxonomy data to utf-8 and delimiter to "|"
# csvformat took some extra effort on OSX, had to install pip then get csvkit
iconv -c -t utf8 $4 | csvformat -T > taxonomy.tab

psql -U $1 -d $2 -h $5 -f nppes_tables.sql

#had to change this to pipe and use stdin
#when you use psql and COPY remotely, it runs the COPY command on that server so the file isn't readable
cat taxonomy.tab | psql -U $1 -d $2 -h $5 -c "COPY taxonomy FROM STDIN WITH CSV HEADER DELIMITER AS E'\t';"

#doing this remotely is SLOW!
cat cleaned_npi.csv | psql -U $1 -d $2 -h $5 -c "COPY npi FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS '';"

psql -U $1 -d $2 -h $5 -f nppes_indexes.sql

2) -

--constraints
--make npi PK
ALTER TABLE npi ADD CONSTRAINT PRIMARY KEY (npi);

--indexes on npi
--npi is implicit because it is a PK 
--index npi as a varchar since other systems represent npis as strings
CREATE INDEX npi_npi_as_string ON npi (cast(npi as varchar));
--replacement npi
CREATE INDEX npi_replacement ON npi (replacement_npi));
--last name (upper)
CREATE INDEX npi_upper_last ON npi (upper(provider_last_name));
--first and last name (composite, upper)
CREATE INDEX npi_upper_first_last ON npi (upper(provider_last_name), upper(provider_first_name));
--entity_type
CREATE INDEX npi_entity_type ON npi (entity_type_code);

semerj / nppes-postgres

Shell script modifications to run remotely and add indexes #1