ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Add Airflow DAG to pull Nominet data and upload to HDFS #104

Closed anjackson closed 1 year ago

anjackson commented 1 year ago

This Docker command can be used to get recent Nominet data and upload to HDFS, and should be run in an Airflow DAG monthly:

docker run ukwa/ukwa-manage python -m lib.store.nominet

But this needs to run somewhere that can SFTP to the outside world and talk to the H3 HDFS API, so needs to run on e.g. a crawler machine rather than running directly on the Docker Swarms. It's not clear how best to do it. Some options:

  1. It could use an SSH connection to log into a crawler (i.e. using a passwordless key-pair or some kind of Airflow secret) and then run a bash command that happens to run the docker command we need.
  2. It could talk to the Docker daemon on the crawler and run the job directly, which is simpler to code up but means exposing the Docker daemon port (which we've not really done before).
  3. Ask for the Prod Swarm servers to be able to make this outgoing secure connection directly.
anjackson commented 1 year ago

Using an SSHOperator to run the Docker command is likely the best bet. I've created and ingest user in the docker group on crawler06 for this purpose, but needs a password/cert setting up and adding as Airflow secrets.

anjackson commented 1 year ago

Implemented and working.