Data processing and microsite for Leeds 2023
NEW NEW NEW: This repo contains a velociraptor.yaml
file to capture scripts. Take a look at https://velociraptor.run/ More documentation to come...
The repo contains a series of pipelines which are used to collect and process data.
If you are running the python scripts, you will need to install the dependencies listed in requirements.txt
.
You will also need to set PYTHONPATH
in your environment to include scripts
. On a mac, this can be acheived
with the following command: export PYTHONPATH=scripts
. Without that, the scripts will not run, and will throw
an error similar to this:
ModuleNotFoundError: No module named 'metrics'
Some of the scripts and data are managed in a DVC pipeline.
DVC has been added to the requirements.txt
file, so ensure that your python
environment has the required dependencies installed. This could be as simple as
running pip3 install -r requirements.txt
. It's recommended to use a virtual
environment tool such as virtualenv
to avoid clashing requirements.
The repo uses data held in AWS S3 buckets. To access this, make sure
AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
are set for your environment.
Here are some useful DVC commands:
dvc status
.dvc update -R working
.dvc repro -P
. If no stage dependencies (input
files or code) have changed, nothing will be executed.dvc stage list --all
. You can see the
dependency graph with dvc dag
dvc repro --force <stage name>
.