Open seanprivett opened 8 months ago
Root page for statistics: https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics
Very thorough :+1: A couple of observations:
Is this the behaviour we want? Or do we want OMSQ to be one catalogue entry, with all of its quarterly releases listed under it as datasets? I think to do that we need to scrape this page.
Will bring up the question of granularity in tomorrow's standup.
I've updated description to include details of the GOV.UK content API, which we could use in conjunction with search if we need metadata not included in the search response, i.e. anything not declared here. Anything that is rendered on the page will be exposed via that API in a JSON format.
Also included some information about the linkage between individual releases of statistical publications and their "document collections" in case we decide to catalogue the collection rather than every release.
sorry, late to the game. i got a bit lost understanding what we mean by 'scraping' since that could mean any and all content, or just some. my gut reaction is that we should have 1 entry per unique publication - telling users 'this publication exists and it gets issued quarterly and it's at this address [url] and it's of statistics quality" - and ideally it should show users what datapoints/measures are being presented there - eg prison population, remand prison population, number of recalls, etc. and if users are interested in any of these they click through to the browser and take it from there.
Question from refinement - are there any publications not listed in Justice Data?
Youth Justice stats: https://www.gov.uk/government/collections/youth-justice-annual-statistics Prison population projections: https://www.gov.uk/government/statistics/prison-population-projections-ns Prison education & programmes: https://www.gov.uk/government/collections/prison-education-and-accredited-programme-statistics MAPPA: https://www.gov.uk/government/organisations/ministry-of-justice/series/multi-agency-public-protection-arrangements-mappa-annual-reports Justice Data Lab: https://www.gov.uk/government/organisations/ministry-of-justice/series/justice-data-lab-pilot-statistics Equality stats: https://www.gov.uk/government/collections/race-and-the-criminal-justice-system https://www.gov.uk/government/collections/women-and-the-criminal-justice-system https://www.gov.uk/government/collections/hmpps-annual-offender-equalities-report https://www.gov.uk/government/collections/hmpps-annual-staff-equalities-report Knife crime: https://www.gov.uk/government/organisations/ministry-of-justice/series/knife-possession-sentencing-quarterly
Some publications are arranged in document collections, some aren't.
To split up the files to a more digestable number, here are the collections and documents with no collection.
Collections (51): {'Accredited programmes annual bulletin', 'Ad hoc justice statistics', 'Alcohol and drug misuse and treatment statistics', 'Antisocial behaviour', 'Civil justice statistics', 'Civil justice statistics quarterly', 'Compendium of re-offending', 'Coroners and burials statistics', 'Court statistics (quarterly)', 'Crime statistics', 'Criminal court statistics', 'Criminal justice statistics', 'Criminal justice statistics quarterly', 'Death of offenders in the community', 'Electronic Monitoring Statistics Publication', 'Ethnicity and the criminal justice system', 'Family Court Statistics Quarterly', 'Freedom of Information statistics', 'Gender Recognition Certificate statistics', 'HM Prison and Probation Service COVID-19 statistics monthly', 'HM Prison and Probation Service workforce statistics', 'HMPPS COVID-19 weekly data', 'HMPPS annual offender equalities report', 'HMPPS annual staff equalities report', 'Hate crime statistics', 'Judicial and court statistics', 'Judicial diversity statistics', 'Justice Data Lab statistics', 'Knife and offensive weapon sentencing statistics', 'Legal aid statistics', 'Legal aid statistics data files', 'Local adult reoffending', 'Mortgage and landlord possession statistics', 'Multi-agency public protection arrangements (MAPPA) annual report', 'New criminal offences statistics', 'Offender management statistics quarterly', 'Payment by results statistics', 'Prison and Probation Performance Statistics', 'Prison population statistics', 'Prisons and probation statistics', 'Probation Service workforce quarterly reports', 'Proven reoffending statistics', 'Restricted Patients Statistics, England and Wales', 'Safety in custody statistics', 'Statistics on privacy injunctions', 'Statistics on public disorder of 6-9 August 2011', 'Topical criminal justice publications', 'Tribunals statistics', 'Use of language interpreter and translation services in courts and tribunals ' 'statistics', 'Women and the criminal justice system', 'Youth Justice Statistics'}
Lone indexes (73): ['Ministry of justice', 'Tribunals statistics quarterly april to june 2024', 'Mortgage and landlord possession statistics april to june 2024', 'Hmpps annual digest april 2023 to march 2024', 'Prison education and accredited programme statistics 2023 to 2024', 'Diversity of the judiciary 2024 statistics', 'Hmpps offender equalities annual report 2022 to 2023', 'Estimates of children with a parent in prison', 'Prison population projections 2023 to 2028', 'Unpaid work management information', 'Prison education and accredited programme statistics 2022 to 2023', 'Offender accommodation outcomes update to march 2024', 'Prison education statistics 2019 2020', 'Justice in numbers summary tables', 'Offender employment outcomes update to march 2024', 'Prison population projections 2022 to 2027', 'Story of the prison population 1993 to 2020', 'Annual hm prison and probation service digest 2017 to 2018', 'Hmpps annual digest april 2020 to march 2021', 'Ministry of justice statistics policy and procedures', 'Prison education and accredited programme statistics 2021 2022', 'Prison population projections 2021 to 2026', 'Statistics of mentally disordered offenders ns', 'Licence recalls and returns to custody', 'Story of the adjudications 2011 to 2018', 'Community rehabilitation companies workforce information report quarter 3 ' '2014 to 2015', 'Judicial and court statistics 2006', 'Prison population projections 2020 to 2026', 'Ad hoc accommodation following release from custody', 'Average time from arrest to sentence for persistent young offenders ns', 'Conviction histories statistics and data', 'Motoring offences and breath test statistics ns', 'Statistics policy and procedures 2', 'Community performance quarterly mi update to june 2016', 'Judicial diversity statistics 2019', 'Judicial selection and recommendations for appointment statistics april 2016 ' 'to march 2017', 'Review on quality in probation supervision', 'Coroners and burials statistics and data eighth summary', 'Time intervals for criminal proceedings in magistrates courts ns', 'Information rights tracker surveys', 'Statistics archive', 'Burial grounds the results of a survey of burial grounds in england and ' 'wales', 'Company winding up and bankruptcy petition statistics', 'End of custody licence releases and recalls statistics', 'Local variations in sentencing in england and wales', 'Her majestys courts service court user survey', 'Multi agency public protection arrangements annual report', 'Topics for comment', 'Ninth summary of coroners reports to prevent future deaths', 'Interim re conviction figures for the peterborough and doncaster payment by ' 'results pilots', 'Exceptional case funding in legal aid statistics', 'Statistical notice criminal legal aid statistics oct 2012 to sep 2013', 'Electronic monitoring statistics publication december 2023', 'Ad hoc statistical release covering exceptional case funding in legal aid ' 'statistics', 'Intention to publish further breakdowns of reoffences by type of reoffence', 'Future publication legal aid statistics 2013 to 2014', 'Future publication exceptional case funding statistics april to june 2014', 'Future publication legal aid statistics quarterly from april to june 2014', 'Future publication legal aid statistics quarterly july to september 2014 ' 'incorporating barrister fee payments from public sources in 2013 to 2014', 'Ad hoc community payback on rapid deployment projects', 'Judicial diversity statistics 2016', 'Multi agency public protection arrangements mappa annual 2023 to 2024', 'Changes to moj statistics case progression', 'Judicial diversity statistics 2018', 'Judicial selection and recommendations for appointment statistics april 2017 ' 'to march 2018', 'Judicial diversity statistics 2017 2', 'Crown court sentencing survey annual publication january to december 2014', 'Judicial selection and recommendations for appointment statistics april 2015 ' 'to march 2016', 'Mmpr data collection april to september 2014', 'Intention to publish re conviction results for pbr pilots', 'Crown court sentencing survey annual publication january to december 2013', 'Employment rates following release from custody ad hoc', 'Judicial diversity statistics 2015']
This is my parameter slug:
url = "https://www.gov.uk/api/search.json"
params = {
"filter_organisations": "ministry-of-justice",
"filter_content_store_document_type": [
"national_statistics",
"official_statistics",
# These options don't add anything useful
# "statistical_data_set",
# "statistics"
],
"fields": "document_collections",
# Maximum results per API call is 1500
"count": 1500,
"start": 0
}
Note CJS stats are in this list, which we already ingest from the CJS dashboard.
There's a lot of good data we do not have - we should ingest all of these publications to DataHub. Contacts and Domains for these items have imperfect solutions.
How the search API works Search API fields
DatasetContainerSubTypes.FOLDER
find-moj-data field | API field |
---|---|
Title | title |
Description | description |
External link | link |
Data last updated | public_timestamp |
Metadata last updated | Source from DataHub |
Provider | data.gov.uk |
Update frequency | (This only applies to collections, where it is in the title) |
Contact | |
Domain |
search_api = "https://www.gov.uk/api/search.json"
params = {
"filter_organisations": "ministry-of-justice",
"filter_content_store_document_type": [
"national_statistics",
"official_statistics"
],
"fields": [
"description",
"document_collections",
"link",
"public_timestamp",
"title",
"first_published_at"
],
# Maximum results per API call is 1500
"count": 1500,
"start": 0
}
Value
So that we can evaluate the need for cataloguing public data we want to try scraping MOJ's data publications from GOV.UK and in the process we will gain experience of working with custom ingestion sources in datahub.
Hypothesis
Proposal
How to create the custom ingestion source
There is an existing repo for ingesting from MOJ APIs https://github.com/ministryofjustice/datahub-custom-api-source
Can either repurpose this or terraform a separate repo in https://github.com/ministryofjustice/data-platform/blob/main/terraform/github/data-catalogue.tf (Note: each ingestion source should be a separate python package)
Follow https://datahubproject.io/docs/how/add-custom-ingestion-source/ to create the ingestion source. See https://github.com/ministryofjustice/datahub-custom-api-source/pull/1 for an example
Metadata to include
{ condition: EQUAL, field: "typeNames", values:"publication"}
How to scrape the metadata
The publications we want to pull in are those listed on https://www.gov.uk/government/organisations/ministry-of-justice/about/statistics
There are two views here:
organisations
filter preselected toministry-of-justice
and the document type preselected to "Published statistics".The statistics finder gets its data from the GOV.UK search API and renders it using GOV.UK finder frontend. We can go directly to the search API to get the same metadata (there is also an RSS feed but I don't think this will be good enough as it only includes the most recent publications).
The following URL gets everything returned by the finder:
https://www.gov.uk/api/search.json?filter_organisations=ministry-of-justice&filter_content_store_document_type=national_statistics&filter_content_store_document_type=official_statistics&filter_content_store_document_type=statistical_data_set&filter_content_store_document_type=statistics&fields=document_collections
(1030 publications)
The search API is documented at https://www.api.gov.uk/gds/gov-uk-search/#gov-uk-search
Use the pagination options documented here: https://docs.publishing.service.gov.uk/repos/search-api/using-the-search-api.html#pagination
For each result, the
document_collections
field links the publication to any document collection it belongs to.If we need to, then each link can be looked up in the GOV.UK content API to get more granular information such as
Example content API representation: for
/government/statistics/offender-management-statistics-quarterly-july-to-september-2023
: https://www.gov.uk/api/content/government/statistics/offender-management-statistics-quarterly-july-to-september-2023Note: the logic for mapping "published statistics" to a list of document types can be found here: https://github.com/alphagov/finder-frontend/blob/86632663013338ab86cbf66a39088e4adc6c852d/app/models/filters.rb#L9
Definition of done
To be discussed