Spike: explore value of scraping MOJ publications

seanprivett commented 8 months ago

Value

So that we can evaluate the need for cataloguing public data we want to try scraping MOJ's data publications from GOV.UK and in the process we will gain experience of working with custom ingestion sources in datahub.

Hypothesis

If we have publications in the catalogue data consumers will search for them and make use of them
Later down the line, when we have charts & lineage, data consumers will have more confidence in the charts if they know where the data came from

Proposal

Write a custom ingestion that can be run locally
Investigate how to run that ingestion process on CP. This can be split into another ticket if it will involve significant additional work.

How to create the custom ingestion source

There is an existing repo for ingesting from MOJ APIs https://github.com/ministryofjustice/datahub-custom-api-source

Can either repurpose this or terraform a separate repo in https://github.com/ministryofjustice/data-platform/blob/main/terraform/github/data-catalogue.tf (Note: each ingestion source should be a separate python package)

Follow https://datahubproject.io/docs/how/add-custom-ingestion-source/ to create the ingestion source. See https://github.com/ministryofjustice/datahub-custom-api-source/pull/1 for an example

Metadata to include

Use the Dataset entity type
Tag as: "publicly available"
The description for each publication will match the text displayed in the finder
See if we can use dataset subTypes to model publications (or publication formats) as a specialisation of dataset
- make sure we can filter on sub type: { condition: EQUAL, field: "typeNames", values:"publication"}
- if not, we can instead add a "publication" tag to distinguish publications and regular datasets

How to scrape the metadata

The publications we want to pull in are those listed on https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics

There are two views here:

The searchable publications are rendered by the statistics finder with the organisations filter preselected to ministry-of-justice and the document type preselected to "Published statistics".
Publications are grouped into document collections, e.g. Offender management statistics quarterly encompasses many releases of the same publication, e.g. https://www.gov.uk/government/statistics/offender-management-statistics-quarterly-july-to-september-2023

The statistics finder gets its data from the GOV.UK search API and renders it using GOV.UK finder frontend. We can go directly to the search API to get the same metadata (there is also an RSS feed but I don't think this will be good enough as it only includes the most recent publications).

The following URL gets everything returned by the finder:

https://www.gov.uk/api/search.json?filter_organisations=ministry-of-justice&filter_content_store_document_type=national_statistics&filter_content_store_document_type=official_statistics&filter_content_store_document_type=statistical_data_set&filter_content_store_document_type=statistics&fields=document_collections

(1030 publications)

The search API is documented at https://www.api.gov.uk/gds/gov-uk-search/#gov-uk-search

Use the pagination options documented here: https://docs.publishing.service.gov.uk/repos/search-api/using-the-search-api.html#pagination

For each result, the document_collections field links the publication to any document collection it belongs to.

If we need to, then each link can be looked up in the GOV.UK content API to get more granular information such as

Links and content types of all "attachments" shown on the page, such as CSV, Excel spreadsheets
The full body of the page, rendered in html (after "govspeak" syntax has been translated to html)

Example content API representation: for /government/statistics/offender-management-statistics-quarterly-july-to-september-2023: https://www.gov.uk/api/content/government/statistics/offender-management-statistics-quarterly-july-to-september-2023

Note: the logic for mapping "published statistics" to a list of document types can be found here: https://github.com/alphagov/finder-frontend/blob/86632663013338ab86cbf66a39088e4adc6c852d/app/models/filters.rb#L9

Definition of done

[ ] 1030 publications ingested into dev/test datahub instances
[ ] We can retrigger the ingestion at any time
[ ] We understand how we could configure/schedule the ingestion from the datahub UI
[ ] If we decide to proceed, then raise a ticket for the implementation

To be discussed

[ ] Do we want one catalogue entry per publication, or one catalogue entry per release of a publication? E.g. https://www.gov.uk/government/collections/proven-reoffending-statistics vs https://www.gov.uk/government/statistics/offender-management-statistics-quarterly-july-to-september-2023
[ ] Should each format of a published release (HTML, CSV, PDF, XLSX, ODF) be a separate catalogue entry? Do we make one format the canonical entry? Or do we just link out the GOV.UK page that includes all formats?

jemnery commented 8 months ago

Root page for statistics: https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics

jemnery commented 8 months ago

Very thorough :+1: A couple of observations:

The search API is definitely more convenient and readable than creating a web scraper. But the drawback is it only lists the latest instance of a publication (e.g. "Offender management statistics quarterly: July to September 2023") not the "parent" publication or collection. So for example OMSQ - we'd be cataloguing this URL instead of the "home" URL

Is this the behaviour we want? Or do we want OMSQ to be one catalogue entry, with all of its quarterly releases listed under it as datasets? I think to do that we need to scrape this page.

If we do want to catalogue each release of a publication separately, is a single publication strictly speaking a dataset? A release of something like the reoffending stats has a number of assets (HTML doc, PDF bulletin, various Excel and CSV files)

MatMoore commented 8 months ago

Will bring up the question of granularity in tomorrow's standup.

I've updated description to include details of the GOV.UK content API, which we could use in conjunction with search if we need metadata not included in the search response, i.e. anything not declared here. Anything that is rendered on the page will be exposed via that API in a JSON format.

Also included some information about the linkage between individual releases of statistical publications and their "document collections" in case we decide to catalogue the collection rather than every release.

alex-vonfeldmann commented 8 months ago

sorry, late to the game. i got a bit lost understanding what we mean by 'scraping' since that could mean any and all content, or just some. my gut reaction is that we should have 1 entry per unique publication - telling users 'this publication exists and it gets issued quarterly and it's at this address [url] and it's of statistics quality" - and ideally it should show users what datapoints/measures are being presented there - eg prison population, remand prison population, number of recalls, etc. and if users are interested in any of these they click through to the browser and take it from there.

jemnery commented 4 days ago

Question from refinement - are there any publications not listed in Justice Data?

murdo-moj commented 1 day ago

Some publications are arranged in document collections, some aren't.

To split up the files to a more digestable number, here are the collections and documents with no collection.

Collections (51): {'Accredited programmes annual bulletin', 'Ad hoc justice statistics', 'Alcohol and drug misuse and treatment statistics', 'Antisocial behaviour', 'Civil justice statistics', 'Civil justice statistics quarterly', 'Compendium of re-offending', 'Coroners and burials statistics', 'Court statistics (quarterly)', 'Crime statistics', 'Criminal court statistics', 'Criminal justice statistics', 'Criminal justice statistics quarterly', 'Death of offenders in the community', 'Electronic Monitoring Statistics Publication', 'Ethnicity and the criminal justice system', 'Family Court Statistics Quarterly', 'Freedom of Information statistics', 'Gender Recognition Certificate statistics', 'HM Prison and Probation Service COVID-19 statistics monthly', 'HM Prison and Probation Service workforce statistics', 'HMPPS COVID-19 weekly data', 'HMPPS annual offender equalities report', 'HMPPS annual staff equalities report', 'Hate crime statistics', 'Judicial and court statistics', 'Judicial diversity statistics', 'Justice Data Lab statistics', 'Knife and offensive weapon sentencing statistics', 'Legal aid statistics', 'Legal aid statistics data files', 'Local adult reoffending', 'Mortgage and landlord possession statistics', 'Multi-agency public protection arrangements (MAPPA) annual report', 'New criminal offences statistics', 'Offender management statistics quarterly', 'Payment by results statistics', 'Prison and Probation Performance Statistics', 'Prison population statistics', 'Prisons and probation statistics', 'Probation Service workforce quarterly reports', 'Proven reoffending statistics', 'Restricted Patients Statistics, England and Wales', 'Safety in custody statistics', 'Statistics on privacy injunctions', 'Statistics on public disorder of 6-9 August 2011', 'Topical criminal justice publications', 'Tribunals statistics', 'Use of language interpreter and translation services in courts and tribunals ' 'statistics', 'Women and the criminal justice system', 'Youth Justice Statistics'}

Lone indexes (73): ['Ministry of justice', 'Tribunals statistics quarterly april to june 2024', 'Mortgage and landlord possession statistics april to june 2024', 'Hmpps annual digest april 2023 to march 2024', 'Prison education and accredited programme statistics 2023 to 2024', 'Diversity of the judiciary 2024 statistics', 'Hmpps offender equalities annual report 2022 to 2023', 'Estimates of children with a parent in prison', 'Prison population projections 2023 to 2028', 'Unpaid work management information', 'Prison education and accredited programme statistics 2022 to 2023', 'Offender accommodation outcomes update to march 2024', 'Prison education statistics 2019 2020', 'Justice in numbers summary tables', 'Offender employment outcomes update to march 2024', 'Prison population projections 2022 to 2027', 'Story of the prison population 1993 to 2020', 'Annual hm prison and probation service digest 2017 to 2018', 'Hmpps annual digest april 2020 to march 2021', 'Ministry of justice statistics policy and procedures', 'Prison education and accredited programme statistics 2021 2022', 'Prison population projections 2021 to 2026', 'Statistics of mentally disordered offenders ns', 'Licence recalls and returns to custody', 'Story of the adjudications 2011 to 2018', 'Community rehabilitation companies workforce information report quarter 3 ' '2014 to 2015', 'Judicial and court statistics 2006', 'Prison population projections 2020 to 2026', 'Ad hoc accommodation following release from custody', 'Average time from arrest to sentence for persistent young offenders ns', 'Conviction histories statistics and data', 'Motoring offences and breath test statistics ns', 'Statistics policy and procedures 2', 'Community performance quarterly mi update to june 2016', 'Judicial diversity statistics 2019', 'Judicial selection and recommendations for appointment statistics april 2016 ' 'to march 2017', 'Review on quality in probation supervision', 'Coroners and burials statistics and data eighth summary', 'Time intervals for criminal proceedings in magistrates courts ns', 'Information rights tracker surveys', 'Statistics archive', 'Burial grounds the results of a survey of burial grounds in england and ' 'wales', 'Company winding up and bankruptcy petition statistics', 'End of custody licence releases and recalls statistics', 'Local variations in sentencing in england and wales', 'Her majestys courts service court user survey', 'Multi agency public protection arrangements annual report', 'Topics for comment', 'Ninth summary of coroners reports to prevent future deaths', 'Interim re conviction figures for the peterborough and doncaster payment by ' 'results pilots', 'Exceptional case funding in legal aid statistics', 'Statistical notice criminal legal aid statistics oct 2012 to sep 2013', 'Electronic monitoring statistics publication december 2023', 'Ad hoc statistical release covering exceptional case funding in legal aid ' 'statistics', 'Intention to publish further breakdowns of reoffences by type of reoffence', 'Future publication legal aid statistics 2013 to 2014', 'Future publication exceptional case funding statistics april to june 2014', 'Future publication legal aid statistics quarterly from april to june 2014', 'Future publication legal aid statistics quarterly july to september 2014 ' 'incorporating barrister fee payments from public sources in 2013 to 2014', 'Ad hoc community payback on rapid deployment projects', 'Judicial diversity statistics 2016', 'Multi agency public protection arrangements mappa annual 2023 to 2024', 'Changes to moj statistics case progression', 'Judicial diversity statistics 2018', 'Judicial selection and recommendations for appointment statistics april 2017 ' 'to march 2018', 'Judicial diversity statistics 2017 2', 'Crown court sentencing survey annual publication january to december 2014', 'Judicial selection and recommendations for appointment statistics april 2015 ' 'to march 2016', 'Mmpr data collection april to september 2014', 'Intention to publish re conviction results for pbr pilots', 'Crown court sentencing survey annual publication january to december 2013', 'Employment rates following release from custody ad hoc', 'Judicial diversity statistics 2015']

murdo-moj commented 1 day ago

This is my parameter slug:

url = "https://www.gov.uk/api/search.json"
params = {
    "filter_organisations": "ministry-of-justice",
    "filter_content_store_document_type": [
        "national_statistics",
        "official_statistics",
        # These options don't add anything useful
        # "statistical_data_set",
        # "statistics"
    ],
    "fields": "document_collections",
    # Maximum results per API call is 1500
    "count": 1500,
    "start": 0
}

murdo-moj commented 1 day ago

Note CJS stats are in this list, which we already ingest from the CJS dashboard.

murdo-moj commented 10 hours ago

There's a lot of good data we do not have - we should ingest all of these publications to DataHub. Contacts and Domains for these items have imperfect solutions.

Caveats

The subject of some publications overlap with Justice Data and CJS Dashboard. The distinction should be clear with one entity set as a dataset and one as a chart/dashboard.
Some of the publications are not tagged to the document collection they should be in. For example "Judicial diversity statistics 2019" is a lone document without the relationship to the collection which presumably is "Judicial diversity statistics". This should be fixed upstream, we can just be aware this is an issue or flag it with data.gov.uk.
If the lone publications become an issue for users we have the option of only ingesting publications which are part of a collection. This would run the risk of obscuring useful data for users.

Domains

Mapping the CaDeT domain model on to the publications could be practically done via a mapping file to map collections to domains. There are 51 collections so this is somewhat manageable. The mapping would be maintained manually.
Lone documents unassigned to a collection would remain unassigned.
However, some publications span our current domains eg "HMPPS COVID-19 weekly" data spans Prison and Probation. Some are about subjects we don't have domains for eg Legal aid statistics.
An alternative is to create a new subject area "MoJ Publications" to throw them all into. This would be unhelpful for users wanting to browse via subject area, but our existing domain model is only partly suitable to categorise the publications.

Contacts / Custodians

Contact information is not included in the search API. On individual publications sometimes a contact email or team email for a member of staff is listed in the HTML release. For some publications they are PDFs with no contact information at all.
We could write code to scrape available HTML releases for team emails, or create another mapping file.
A team email at the collection level (propagated to publication level) feels like a reasonable middle ground for population.

How the search API works Search API fields

Implementation

Each document collection can be an instance of DatasetContainerSubTypes.FOLDER
Each document index (mostly publication editions) can be a generic datahub table.
We can then populate metadata fields for each document:

find-moj-data field	API field
Title	title
Description	description
External link	link
Data last updated	public_timestamp
Metadata last updated	Source from DataHub
Provider	data.gov.uk
Update frequency	(This only applies to collections, where it is in the title)
Contact
Domain

search_api = "https://www.gov.uk/api/search.json"
params = {
    "filter_organisations": "ministry-of-justice",
    "filter_content_store_document_type": [
        "national_statistics",
        "official_statistics"
    ],
    "fields": [
        "description",
        "document_collections",
        "link",
        "public_timestamp",
        "title",
        "first_published_at"
    ],
    # Maximum results per API call is 1500
    "count": 1500,
    "start": 0
}

ministryofjustice / find-moj-data