ETL to retrieve 'Sexual Health Information and Support Services', and 'Chlamydia Screening for Under 25s' data from syndication and store as JSON
Clone the repo: git clone https://github.com/nhsuk/sexual-health-information-etl.git
and review the scripts
to get up and running.
The application uses Docker to run in containers. Development is typically done on the host machine. Files are loaded into the container and changes are automatically updated.
Use the test
script for continuous testing during development.
In order for the process to access the syndication feed an API key is required.
Details of registration are available on
NHS Choices.
The application needs the API key available within the environment as the variable SYNDICATION_API_KEY
.
The output is uploaded to Azure Blob Storage, a suitable connection string should be set in the AZURE_STORAGE_CONNECTION_STRING
variable.
For further details see Azure Blob Storage.
The ETL calls the modifiedsince
end point to determine if the ETL needs to be run.
If the endpoint returns a 404 error, i.e page not found, there is no updated data and the ETL will not run.
If a 200
code is returned the data is considered to have changed, and all data will be reloaded via the all
end point.
All the data is reloaded when a change is made to the Sexual Health Services data, and the ID for a record may change when reloaded.
An incremental update approach is not possible due to IDs not being consistent.
The date of the most recently modified file in blob storage beginning shis-data
* is used as the last run date.
If no file is present the last known good run date of 20/02/2018
will be used.
This date may be changed by the setting enviroment variable INITIAL_LAST_RUN_DATE
.
The ETL version from the package.json is included along with a datestamp to enable a full rescan if the data structure changes, as this will be reflected in the application version.
Once the initial scan is complete, failed records will be revisited. IDs for records still failing after the second attempt
are listed in a shis-data-summary.json
* file.
Running scripts/start
will bring up a docker container and initiate the scrape at a scheduled time, GMT. The default is
11pm. The time of the scrape can be overridden by setting the environment variable ETL_SCHEDULE
.
e.g. export ETL_SCHEDULE='25 15 * * *'
will start the processing at 3:25pm.
Note: the container time is GMT and does not take account of daylight saving, you may need to subtract an hour from the time if
it is currently BST.
During local development it is useful to run the scrape at any time. This is possible by running node app.js
(with the appropriate env vars set).
Further details on node-schedule available here
A successful scrape will result in the file shis-data.json
* being written to the output
folder and to the Azure Blob Storage
location specified in the environmental variables.
The files uploaded to Azure Blob Storage are:
shis-data-summary-YYYYMMDD-VERSION.json
*
shis-data-YYYYMMDD-VERSION.json
*
shis-data.json
*
where YYYYMMDD
is the current year, month and date, and VERSION
is the current major version of the ETL as defined in the package.json
.
The ETL may be configured to collect data from the 'Sexual Health Information and Support Services', or the 'Chlamydia Screening for Under 25s' end point in Syndication. The details above describe the operation when configured to retrieve 'Sexual Health Information and Support Services' data.
The provided docker-compose.yml runs two containers, one for each possible ETL as described
below. The output files have been set to dev-shis-data
and dev-csu25-data
respectively.
This ensures that during development the output files will all include a dev-
prefix so as not overwrite the production shis-data
and csu25-data
files in Azure.
The output JSON will be an array of objects in the format shown in the Sample SHIS Data file.
Most of the fields are self-explanatory, the three that require further explanation are id
, location
, and venueType
.
id
is only used internally. This is an ID retrieved from Syndication and may change when data is updated.
gsdId
is the Generic Service Directory ID at time of running. This is the ID used in the NHS Services Directory, i.e. 9413748
would be accessed at the URL http://www.nhs.uk/ServiceDirectories/Pages/GenericServiceDetails.aspx?id=9413748.
Location
is in GeoJSON format.
venueType
can be either Pharmacy
, Clinic
, Community
, Other
, ClinicCommunity
or ClinicPharmacy
As the application is being developed, every Pull Request has its own test environment automatically built and deployed to.
Environment variables are expected to be managed by the environment in which the application is being run. This is best practice as described by twelve-factor.
In order to protect the application from starting up without the required env vars in place require-environment-variables is used to check for the env vars that are required for the application to run successfully. This happens during the application start-up. If an env var is not found the application will fail to start and an appropriate message will be displayed.
Environment variables are used to set application level settings for each environment.
Variable | Description | Default | Required |
---|---|---|---|
INITIAL_LAST_RUN_DATE |
Initial run date in YYYY-MM-DD format to use if no previous run detected |
2018-02-20 | |
AZURE_STORAGE_CONNECTION_STRING |
Azure storage connection string | yes | |
CONTAINER_NAME |
Azure storage container name | etl-output | |
DISABLE_SCHEDULER |
set to 'true' to disable the scheduler | false | |
ETL_SCHEDULE |
Time of day to run the upgrade. Syntax | 0 23 * (11:00 pm) | |
LOG_LEVEL |
log level | Depends on NODE_ENV |
|
NODE_ENV |
node environment | development | |
SYNDICATION_API_KEY |
API key to access syndication | yes | |
SYNDICATION_SERVICE_END_POINT |
URL to Syndicate service to scrape | http://v1.syndication.nhschoices.nhs.uk/services/types/sexualhealthinformationandsupport | no |
OUTPUT_FILE |
Filename saved to azure | shis-data | |
ETL_NAME |
ETL name for logging purposes | set to shis-data-etl in the docker compose file |
yes |
The docker-compose.yml
used for development and deployment via Rancher have a similar structure.
A stack is run with two sexual-health-information-etl
images with different configurations.
The convention for environment variables used in the Rancher configuration is to add a SHIS_
or CSU25_
prefix to the
ETL_SCHEDULE
, INITIAL_LAST_RUN_DATE
, OUTPUT_FILE
, and SYNDICATION_SERVICE_END_POINT
environment variables.
These are then mapped to the appropriate suffix-less variable in the container, i.e. for the
'Sexual Health Information and Support Services' container SHIS_ETL_SCHEDULE
is mapped to ETL_SCHEDULE
, and so on.
LOG_LEVEL
must be a number and one of the defined log levelsThis repo uses Architecture Decision Records to record architectural decisions for this project. They are stored in doc/adr.