nhsuk / sexual-health-information-etl

ETL to retrieve sexual health information and support services from syndication and store as JSON
MIT License
0 stars 1 forks source link
docker-image

Sexual Health Information ETL

ETL to retrieve 'Sexual Health Information and Support Services', and 'Chlamydia Screening for Under 25s' data from syndication and store as JSON

GitHub Release Greenkeeper badge Build Status Coverage Status

Installation

Clone the repo: git clone https://github.com/nhsuk/sexual-health-information-etl.git and review the scripts to get up and running.

Testing

The application uses Docker to run in containers. Development is typically done on the host machine. Files are loaded into the container and changes are automatically updated.

Use the test script for continuous testing during development.

Run process

In order for the process to access the syndication feed an API key is required. Details of registration are available on NHS Choices. The application needs the API key available within the environment as the variable SYNDICATION_API_KEY.

The output is uploaded to Azure Blob Storage, a suitable connection string should be set in the AZURE_STORAGE_CONNECTION_STRING variable. For further details see Azure Blob Storage.

The ETL calls the modifiedsince end point to determine if the ETL needs to be run. If the endpoint returns a 404 error, i.e page not found, there is no updated data and the ETL will not run.

If a 200 code is returned the data is considered to have changed, and all data will be reloaded via the all end point. All the data is reloaded when a change is made to the Sexual Health Services data, and the ID for a record may change when reloaded. An incremental update approach is not possible due to IDs not being consistent.

The date of the most recently modified file in blob storage beginning shis-data* is used as the last run date.

If no file is present the last known good run date of 20/02/2018 will be used. This date may be changed by the setting enviroment variable INITIAL_LAST_RUN_DATE.

The ETL version from the package.json is included along with a datestamp to enable a full rescan if the data structure changes, as this will be reflected in the application version.

Once the initial scan is complete, failed records will be revisited. IDs for records still failing after the second attempt are listed in a shis-data-summary.json* file.

Running scripts/start will bring up a docker container and initiate the scrape at a scheduled time, GMT. The default is 11pm. The time of the scrape can be overridden by setting the environment variable ETL_SCHEDULE. e.g. export ETL_SCHEDULE='25 15 * * *' will start the processing at 3:25pm. Note: the container time is GMT and does not take account of daylight saving, you may need to subtract an hour from the time if it is currently BST.

During local development it is useful to run the scrape at any time. This is possible by running node app.js (with the appropriate env vars set).

Further details on node-schedule available here

A successful scrape will result in the file shis-data.json* being written to the output folder and to the Azure Blob Storage location specified in the environmental variables.

The files uploaded to Azure Blob Storage are:

shis-data-summary-YYYYMMDD-VERSION.json*

shis-data-YYYYMMDD-VERSION.json*

shis-data.json*

where YYYYMMDD is the current year, month and date, and VERSION is the current major version of the ETL as defined in the package.json.

Development Notes

The ETL may be configured to collect data from the 'Sexual Health Information and Support Services', or the 'Chlamydia Screening for Under 25s' end point in Syndication. The details above describe the operation when configured to retrieve 'Sexual Health Information and Support Services' data.

The provided docker-compose.yml runs two containers, one for each possible ETL as described below. The output files have been set to dev-shis-data and dev-csu25-data respectively.

This ensures that during development the output files will all include a dev- prefix so as not overwrite the production shis-data and csu25-data files in Azure.

Structure of JSON Data

The output JSON will be an array of objects in the format shown in the Sample SHIS Data file. Most of the fields are self-explanatory, the three that require further explanation are id, location, and venueType.

id is only used internally. This is an ID retrieved from Syndication and may change when data is updated.

gsdId is the Generic Service Directory ID at time of running. This is the ID used in the NHS Services Directory, i.e. 9413748 would be accessed at the URL http://www.nhs.uk/ServiceDirectories/Pages/GenericServiceDetails.aspx?id=9413748.

Location is in GeoJSON format.

venueType can be either Pharmacy, Clinic, Community, Other, ClinicCommunity or ClinicPharmacy

Test environments

As the application is being developed, every Pull Request has its own test environment automatically built and deployed to.

Environment variables

Environment variables are expected to be managed by the environment in which the application is being run. This is best practice as described by twelve-factor.

In order to protect the application from starting up without the required env vars in place require-environment-variables is used to check for the env vars that are required for the application to run successfully. This happens during the application start-up. If an env var is not found the application will fail to start and an appropriate message will be displayed.

Environment variables are used to set application level settings for each environment.

Variable Description Default Required
INITIAL_LAST_RUN_DATE Initial run date in YYYY-MM-DD format to use if no previous run detected 2018-02-20
AZURE_STORAGE_CONNECTION_STRING Azure storage connection string yes
CONTAINER_NAME Azure storage container name etl-output
DISABLE_SCHEDULER set to 'true' to disable the scheduler false
ETL_SCHEDULE Time of day to run the upgrade. Syntax 0 23 * (11:00 pm)
LOG_LEVEL log level Depends on NODE_ENV
NODE_ENV node environment development
SYNDICATION_API_KEY API key to access syndication yes
SYNDICATION_SERVICE_END_POINT URL to Syndicate service to scrape http://v1.syndication.nhschoices.nhs.uk/services/types/sexualhealthinformationandsupport no
OUTPUT_FILE Filename saved to azure shis-data
ETL_NAME ETL name for logging purposes set to shis-data-etl in the docker compose file yes

Docker Compose Structure for Deployment and Development

The docker-compose.yml used for development and deployment via Rancher have a similar structure. A stack is run with two sexual-health-information-etl images with different configurations.

The convention for environment variables used in the Rancher configuration is to add a SHIS_ or CSU25_ prefix to the ETL_SCHEDULE, INITIAL_LAST_RUN_DATE, OUTPUT_FILE, and SYNDICATION_SERVICE_END_POINT environment variables. These are then mapped to the appropriate suffix-less variable in the container, i.e. for the 'Sexual Health Information and Support Services' container SHIS_ETL_SCHEDULE is mapped to ETL_SCHEDULE, and so on.

FAQ

Architecture Decision Records

This repo uses Architecture Decision Records to record architectural decisions for this project. They are stored in doc/adr.