ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
4 stars 0 forks source link

Scraping MOJ publications #151

Open seanprivett opened 3 months ago

seanprivett commented 3 months ago

Value

So that we can evaluate the need for cataloguing public data we want to try scraping MOJ's data publications from GOV.UK and in the process we will gain experience of working with custom ingestion sources in datahub.

Hypothesis

  1. If we have publications in the catalogue data consumers will search for them and make use of them
  2. Later down the line, when we have charts & lineage, data consumers will have more confidence in the charts if they know where the data came from

Proposal

  1. Write a custom ingestion that can be run locally
  2. Investigate how to run that ingestion process on CP. This can be split into another ticket if it will involve significant additional work.

How to create the custom ingestion source

There is an existing repo for ingesting from MOJ APIs https://github.com/ministryofjustice/datahub-custom-api-source

Can either repurpose this or terraform a separate repo in https://github.com/ministryofjustice/data-platform/blob/main/terraform/github/data-catalogue.tf (Note: each ingestion source should be a separate python package)

Follow https://datahubproject.io/docs/how/add-custom-ingestion-source/ to create the ingestion source. See https://github.com/ministryofjustice/datahub-custom-api-source/pull/1 for an example

Metadata to include

How to scrape the metadata

The publications we want to pull in are those listed on https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics

There are two views here:

  1. The searchable publications are rendered by the statistics finder with the organisations filter preselected to ministry-of-justice and the document type preselected to "Published statistics".
  2. Publications are grouped into document collections, e.g. Offender management statistics quarterly encompasses many releases of the same publication, e.g. https://www.gov.uk/government/statistics/offender-management-statistics-quarterly-july-to-september-2023

The statistics finder gets its data from the GOV.UK search API and renders it using GOV.UK finder frontend. We can go directly to the search API to get the same metadata (there is also an RSS feed but I don't think this will be good enough as it only includes the most recent publications).

The following URL gets everything returned by the finder:

https://www.gov.uk/api/search.json?filter_organisations=ministry-of-justice&filter_content_store_document_type=national_statistics&filter_content_store_document_type=official_statistics&filter_content_store_document_type=statistical_data_set&filter_content_store_document_type=statistics&fields=document_collections

(1030 publications)

The search API is documented at https://www.api.gov.uk/gds/gov-uk-search/#gov-uk-search

Use the pagination options documented here: https://docs.publishing.service.gov.uk/repos/search-api/using-the-search-api.html#pagination

For each result, the document_collections field links the publication to any document collection it belongs to.

If we need to, then each link can be looked up in the GOV.UK content API to get more granular information such as

Example content API representation: for /government/statistics/offender-management-statistics-quarterly-july-to-september-2023: https://www.gov.uk/api/content/government/statistics/offender-management-statistics-quarterly-july-to-september-2023

Note: the logic for mapping "published statistics" to a list of document types can be found here: https://github.com/alphagov/finder-frontend/blob/86632663013338ab86cbf66a39088e4adc6c852d/app/models/filters.rb#L9

Definition of done

To be discussed

jemnery commented 3 months ago

Root page for statistics: https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics

jemnery commented 3 months ago

Very thorough :+1: A couple of observations:

Is this the behaviour we want? Or do we want OMSQ to be one catalogue entry, with all of its quarterly releases listed under it as datasets? I think to do that we need to scrape this page.

MatMoore commented 3 months ago

Will bring up the question of granularity in tomorrow's standup.

I've updated description to include details of the GOV.UK content API, which we could use in conjunction with search if we need metadata not included in the search response, i.e. anything not declared here. Anything that is rendered on the page will be exposed via that API in a JSON format.

Also included some information about the linkage between individual releases of statistical publications and their "document collections" in case we decide to catalogue the collection rather than every release.

alex-vonfeldmann commented 3 months ago

sorry, late to the game. i got a bit lost understanding what we mean by 'scraping' since that could mean any and all content, or just some. my gut reaction is that we should have 1 entry per unique publication - telling users 'this publication exists and it gets issued quarterly and it's at this address [url] and it's of statistics quality" - and ideally it should show users what datapoints/measures are being presented there - eg prison population, remand prison population, number of recalls, etc. and if users are interested in any of these they click through to the browser and take it from there.