ripl-org / historical-ndc

A data-processing framework for downloading and collating historical archives of the National Drug Code Directory.
Other
8 stars 5 forks source link

Assembling a Historical National Drug Code Directory from the Internet Archive

This repositority provides a data-processing pipeline for downloading and collating historical archives of the National Drug Code (NDC) Directory from the Internet Archive. The NDC Directory is only available as a current snapshot that is missing data on previously registered drugs. Also, the drug classification provided by the Directory has changed over time. Because of these inconsistencies and missing data, research with historical pharmacy claims data will have an increasing number of claims going back in time that cannot be matched on NDC code to drug data in the current Directory. The goal of this framework is to build an open-source, comprehensive list of historical NDC codes linked to their last known drug data and classification. As an example, we use the historical list of active ingredients genereated by the pipeline to create a classification of opioid drugs and recovery drugs used in medication-assisted treatment for opioid use disorder.

The pipeline was developed to support research projects at Research Improving People's Lives.

Installation

Requires:

Note: we have patched scons to support Python 3.6+, available in a fork https://github.com/mhowison/scons/releases as release "3.0.1-hotfix1".

We recommend installing the required packages using Anaconda Python. First download and install an Anaconda or Miniconda distribution. Then run the command:

conda create -n historical-ndc -c ripl-org python=3.7 pandas requests scons

This will create a new environment historical-ndc with the patched version of scons available from our Anaconda channel. Load the environment with:

source activate historical-ndc

Run

The run order and dependencies of the scripts are specified in the SConstruct file. The entire pipeline can be run by executing the scons command from the root directory of the repo.

Output

The pipeline will generate three output files in the output/ subdirectory:

Organization