ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Spike: explore value of scraping MOJ publications #151

Open seanprivett opened 8 months ago

seanprivett commented 8 months ago

Value

So that we can evaluate the need for cataloguing public data we want to try scraping MOJ's data publications from GOV.UK and in the process we will gain experience of working with custom ingestion sources in datahub.

Hypothesis

  1. If we have publications in the catalogue data consumers will search for them and make use of them
  2. Later down the line, when we have charts & lineage, data consumers will have more confidence in the charts if they know where the data came from

Proposal

  1. Write a custom ingestion that can be run locally
  2. Investigate how to run that ingestion process on CP. This can be split into another ticket if it will involve significant additional work.

How to create the custom ingestion source

There is an existing repo for ingesting from MOJ APIs https://github.com/ministryofjustice/datahub-custom-api-source

Can either repurpose this or terraform a separate repo in https://github.com/ministryofjustice/data-platform/blob/main/terraform/github/data-catalogue.tf (Note: each ingestion source should be a separate python package)

Follow https://datahubproject.io/docs/how/add-custom-ingestion-source/ to create the ingestion source. See https://github.com/ministryofjustice/datahub-custom-api-source/pull/1 for an example

Metadata to include

How to scrape the metadata

The publications we want to pull in are those listed on https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics

There are two views here:

  1. The searchable publications are rendered by the statistics finder with the organisations filter preselected to ministry-of-justice and the document type preselected to "Published statistics".
  2. Publications are grouped into document collections, e.g. Offender management statistics quarterly encompasses many releases of the same publication, e.g. https://www.gov.uk/government/statistics/offender-management-statistics-quarterly-july-to-september-2023

The statistics finder gets its data from the GOV.UK search API and renders it using GOV.UK finder frontend. We can go directly to the search API to get the same metadata (there is also an RSS feed but I don't think this will be good enough as it only includes the most recent publications).

The following URL gets everything returned by the finder:

https://www.gov.uk/api/search.json?filter_organisations=ministry-of-justice&filter_content_store_document_type=national_statistics&filter_content_store_document_type=official_statistics&filter_content_store_document_type=statistical_data_set&filter_content_store_document_type=statistics&fields=document_collections

(1030 publications)

The search API is documented at https://www.api.gov.uk/gds/gov-uk-search/#gov-uk-search

Use the pagination options documented here: https://docs.publishing.service.gov.uk/repos/search-api/using-the-search-api.html#pagination

For each result, the document_collections field links the publication to any document collection it belongs to.

If we need to, then each link can be looked up in the GOV.UK content API to get more granular information such as

Example content API representation: for /government/statistics/offender-management-statistics-quarterly-july-to-september-2023: https://www.gov.uk/api/content/government/statistics/offender-management-statistics-quarterly-july-to-september-2023

Note: the logic for mapping "published statistics" to a list of document types can be found here: https://github.com/alphagov/finder-frontend/blob/86632663013338ab86cbf66a39088e4adc6c852d/app/models/filters.rb#L9

Definition of done

To be discussed

jemnery commented 8 months ago

Root page for statistics: https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics

jemnery commented 8 months ago

Very thorough :+1: A couple of observations:

Is this the behaviour we want? Or do we want OMSQ to be one catalogue entry, with all of its quarterly releases listed under it as datasets? I think to do that we need to scrape this page.

MatMoore commented 8 months ago

Will bring up the question of granularity in tomorrow's standup.

I've updated description to include details of the GOV.UK content API, which we could use in conjunction with search if we need metadata not included in the search response, i.e. anything not declared here. Anything that is rendered on the page will be exposed via that API in a JSON format.

Also included some information about the linkage between individual releases of statistical publications and their "document collections" in case we decide to catalogue the collection rather than every release.

alex-vonfeldmann commented 8 months ago

sorry, late to the game. i got a bit lost understanding what we mean by 'scraping' since that could mean any and all content, or just some. my gut reaction is that we should have 1 entry per unique publication - telling users 'this publication exists and it gets issued quarterly and it's at this address [url] and it's of statistics quality" - and ideally it should show users what datapoints/measures are being presented there - eg prison population, remand prison population, number of recalls, etc. and if users are interested in any of these they click through to the browser and take it from there.

jemnery commented 4 days ago

Question from refinement - are there any publications not listed in Justice Data?

Youth Justice stats: https://www.gov.uk/government/collections/youth-justice-annual-statistics Prison population projections: https://www.gov.uk/government/statistics/prison-population-projections-ns Prison education & programmes: https://www.gov.uk/government/collections/prison-education-and-accredited-programme-statistics MAPPA: https://www.gov.uk/government/organisations/ministry-of-justice/series/multi-agency-public-protection-arrangements-mappa-annual-reports Justice Data Lab: https://www.gov.uk/government/organisations/ministry-of-justice/series/justice-data-lab-pilot-statistics Equality stats: https://www.gov.uk/government/collections/race-and-the-criminal-justice-system https://www.gov.uk/government/collections/women-and-the-criminal-justice-system https://www.gov.uk/government/collections/hmpps-annual-offender-equalities-report https://www.gov.uk/government/collections/hmpps-annual-staff-equalities-report Knife crime: https://www.gov.uk/government/organisations/ministry-of-justice/series/knife-possession-sentencing-quarterly

murdo-moj commented 1 day ago

Some publications are arranged in document collections, some aren't.

To split up the files to a more digestable number, here are the collections and documents with no collection.

Collections (51): {'Accredited programmes annual bulletin', 'Ad hoc justice statistics', 'Alcohol and drug misuse and treatment statistics', 'Antisocial behaviour', 'Civil justice statistics', 'Civil justice statistics quarterly', 'Compendium of re-offending', 'Coroners and burials statistics', 'Court statistics (quarterly)', 'Crime statistics', 'Criminal court statistics', 'Criminal justice statistics', 'Criminal justice statistics quarterly', 'Death of offenders in the community', 'Electronic Monitoring Statistics Publication', 'Ethnicity and the criminal justice system', 'Family Court Statistics Quarterly', 'Freedom of Information statistics', 'Gender Recognition Certificate statistics', 'HM Prison and Probation Service COVID-19 statistics monthly', 'HM Prison and Probation Service workforce statistics', 'HMPPS COVID-19 weekly data', 'HMPPS annual offender equalities report', 'HMPPS annual staff equalities report', 'Hate crime statistics', 'Judicial and court statistics', 'Judicial diversity statistics', 'Justice Data Lab statistics', 'Knife and offensive weapon sentencing statistics', 'Legal aid statistics', 'Legal aid statistics data files', 'Local adult reoffending', 'Mortgage and landlord possession statistics', 'Multi-agency public protection arrangements (MAPPA) annual report', 'New criminal offences statistics', 'Offender management statistics quarterly', 'Payment by results statistics', 'Prison and Probation Performance Statistics', 'Prison population statistics', 'Prisons and probation statistics', 'Probation Service workforce quarterly reports', 'Proven reoffending statistics', 'Restricted Patients Statistics, England and Wales', 'Safety in custody statistics', 'Statistics on privacy injunctions', 'Statistics on public disorder of 6-9 August 2011', 'Topical criminal justice publications', 'Tribunals statistics', 'Use of language interpreter and translation services in courts and tribunals ' 'statistics', 'Women and the criminal justice system', 'Youth Justice Statistics'}

Lone indexes (73): ['Ministry of justice', 'Tribunals statistics quarterly april to june 2024', 'Mortgage and landlord possession statistics april to june 2024', 'Hmpps annual digest april 2023 to march 2024', 'Prison education and accredited programme statistics 2023 to 2024', 'Diversity of the judiciary 2024 statistics', 'Hmpps offender equalities annual report 2022 to 2023', 'Estimates of children with a parent in prison', 'Prison population projections 2023 to 2028', 'Unpaid work management information', 'Prison education and accredited programme statistics 2022 to 2023', 'Offender accommodation outcomes update to march 2024', 'Prison education statistics 2019 2020', 'Justice in numbers summary tables', 'Offender employment outcomes update to march 2024', 'Prison population projections 2022 to 2027', 'Story of the prison population 1993 to 2020', 'Annual hm prison and probation service digest 2017 to 2018', 'Hmpps annual digest april 2020 to march 2021', 'Ministry of justice statistics policy and procedures', 'Prison education and accredited programme statistics 2021 2022', 'Prison population projections 2021 to 2026', 'Statistics of mentally disordered offenders ns', 'Licence recalls and returns to custody', 'Story of the adjudications 2011 to 2018', 'Community rehabilitation companies workforce information report quarter 3 ' '2014 to 2015', 'Judicial and court statistics 2006', 'Prison population projections 2020 to 2026', 'Ad hoc accommodation following release from custody', 'Average time from arrest to sentence for persistent young offenders ns', 'Conviction histories statistics and data', 'Motoring offences and breath test statistics ns', 'Statistics policy and procedures 2', 'Community performance quarterly mi update to june 2016', 'Judicial diversity statistics 2019', 'Judicial selection and recommendations for appointment statistics april 2016 ' 'to march 2017', 'Review on quality in probation supervision', 'Coroners and burials statistics and data eighth summary', 'Time intervals for criminal proceedings in magistrates courts ns', 'Information rights tracker surveys', 'Statistics archive', 'Burial grounds the results of a survey of burial grounds in england and ' 'wales', 'Company winding up and bankruptcy petition statistics', 'End of custody licence releases and recalls statistics', 'Local variations in sentencing in england and wales', 'Her majestys courts service court user survey', 'Multi agency public protection arrangements annual report', 'Topics for comment', 'Ninth summary of coroners reports to prevent future deaths', 'Interim re conviction figures for the peterborough and doncaster payment by ' 'results pilots', 'Exceptional case funding in legal aid statistics', 'Statistical notice criminal legal aid statistics oct 2012 to sep 2013', 'Electronic monitoring statistics publication december 2023', 'Ad hoc statistical release covering exceptional case funding in legal aid ' 'statistics', 'Intention to publish further breakdowns of reoffences by type of reoffence', 'Future publication legal aid statistics 2013 to 2014', 'Future publication exceptional case funding statistics april to june 2014', 'Future publication legal aid statistics quarterly from april to june 2014', 'Future publication legal aid statistics quarterly july to september 2014 ' 'incorporating barrister fee payments from public sources in 2013 to 2014', 'Ad hoc community payback on rapid deployment projects', 'Judicial diversity statistics 2016', 'Multi agency public protection arrangements mappa annual 2023 to 2024', 'Changes to moj statistics case progression', 'Judicial diversity statistics 2018', 'Judicial selection and recommendations for appointment statistics april 2017 ' 'to march 2018', 'Judicial diversity statistics 2017 2', 'Crown court sentencing survey annual publication january to december 2014', 'Judicial selection and recommendations for appointment statistics april 2015 ' 'to march 2016', 'Mmpr data collection april to september 2014', 'Intention to publish re conviction results for pbr pilots', 'Crown court sentencing survey annual publication january to december 2013', 'Employment rates following release from custody ad hoc', 'Judicial diversity statistics 2015']

murdo-moj commented 1 day ago

This is my parameter slug:

url = "https://www.gov.uk/api/search.json"
params = {
    "filter_organisations": "ministry-of-justice",
    "filter_content_store_document_type": [
        "national_statistics",
        "official_statistics",
        # These options don't add anything useful
        # "statistical_data_set",
        # "statistics"
    ],
    "fields": "document_collections",
    # Maximum results per API call is 1500
    "count": 1500,
    "start": 0
}
murdo-moj commented 1 day ago

Note CJS stats are in this list, which we already ingest from the CJS dashboard.

murdo-moj commented 10 hours ago

There's a lot of good data we do not have - we should ingest all of these publications to DataHub. Contacts and Domains for these items have imperfect solutions.

Caveats

Domains

Contacts / Custodians

How the search API works Search API fields

Implementation

find-moj-data field API field
Title title
Description description
External link link
Data last updated public_timestamp
Metadata last updated Source from DataHub
Provider data.gov.uk
Update frequency (This only applies to collections, where it is in the title)
Contact
Domain
search_api = "https://www.gov.uk/api/search.json"
params = {
    "filter_organisations": "ministry-of-justice",
    "filter_content_store_document_type": [
        "national_statistics",
        "official_statistics"
    ],
    "fields": [
        "description",
        "document_collections",
        "link",
        "public_timestamp",
        "title",
        "first_published_at"
    ],
    # Maximum results per API call is 1500
    "count": 1500,
    "start": 0
}