Open seanprivett opened 3 months ago
Root page for statistics: https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics
Very thorough :+1: A couple of observations:
Is this the behaviour we want? Or do we want OMSQ to be one catalogue entry, with all of its quarterly releases listed under it as datasets? I think to do that we need to scrape this page.
Will bring up the question of granularity in tomorrow's standup.
I've updated description to include details of the GOV.UK content API, which we could use in conjunction with search if we need metadata not included in the search response, i.e. anything not declared here. Anything that is rendered on the page will be exposed via that API in a JSON format.
Also included some information about the linkage between individual releases of statistical publications and their "document collections" in case we decide to catalogue the collection rather than every release.
sorry, late to the game. i got a bit lost understanding what we mean by 'scraping' since that could mean any and all content, or just some. my gut reaction is that we should have 1 entry per unique publication - telling users 'this publication exists and it gets issued quarterly and it's at this address [url] and it's of statistics quality" - and ideally it should show users what datapoints/measures are being presented there - eg prison population, remand prison population, number of recalls, etc. and if users are interested in any of these they click through to the browser and take it from there.
Value
So that we can evaluate the need for cataloguing public data we want to try scraping MOJ's data publications from GOV.UK and in the process we will gain experience of working with custom ingestion sources in datahub.
Hypothesis
Proposal
How to create the custom ingestion source
There is an existing repo for ingesting from MOJ APIs https://github.com/ministryofjustice/datahub-custom-api-source
Can either repurpose this or terraform a separate repo in https://github.com/ministryofjustice/data-platform/blob/main/terraform/github/data-catalogue.tf (Note: each ingestion source should be a separate python package)
Follow https://datahubproject.io/docs/how/add-custom-ingestion-source/ to create the ingestion source. See https://github.com/ministryofjustice/datahub-custom-api-source/pull/1 for an example
Metadata to include
{ condition: EQUAL, field: "typeNames", values:"publication"}
How to scrape the metadata
The publications we want to pull in are those listed on https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics
There are two views here:
organisations
filter preselected toministry-of-justice
and the document type preselected to "Published statistics".The statistics finder gets its data from the GOV.UK search API and renders it using GOV.UK finder frontend. We can go directly to the search API to get the same metadata (there is also an RSS feed but I don't think this will be good enough as it only includes the most recent publications).
The following URL gets everything returned by the finder:
https://www.gov.uk/api/search.json?filter_organisations=ministry-of-justice&filter_content_store_document_type=national_statistics&filter_content_store_document_type=official_statistics&filter_content_store_document_type=statistical_data_set&filter_content_store_document_type=statistics&fields=document_collections
(1030 publications)
The search API is documented at https://www.api.gov.uk/gds/gov-uk-search/#gov-uk-search
Use the pagination options documented here: https://docs.publishing.service.gov.uk/repos/search-api/using-the-search-api.html#pagination
For each result, the
document_collections
field links the publication to any document collection it belongs to.If we need to, then each link can be looked up in the GOV.UK content API to get more granular information such as
Example content API representation: for
/government/statistics/offender-management-statistics-quarterly-july-to-september-2023
: https://www.gov.uk/api/content/government/statistics/offender-management-statistics-quarterly-july-to-september-2023Note: the logic for mapping "published statistics" to a list of document types can be found here: https://github.com/alphagov/finder-frontend/blob/86632663013338ab86cbf66a39088e4adc6c852d/app/models/filters.rb#L9
Definition of done
To be discussed