programminghistorian / jekyll

Jekyll-based static site for The Programming Historian
http://programminghistorian.org
519 stars 229 forks source link

Review all the lessons which use the Old Bailey Online website #3169

Open charlottejmc opened 8 months ago

charlottejmc commented 8 months ago

I am opening a space to review the lessons which use the Old Bailey Online's website, in light of the recent changes to the Old Bailey's API.

Although the old version of the website will still be accessible until August 2024 at https://www.dhi.ac.uk/oldbaileyonline, we want to update lessons which are affected by this change, so they remain usable in the future.

I'll then open a single issue for each of the lessons which do need to be updated, and link to them below.

I have counted 10 lessons which refer to the Old Bailey more or less extensively:

  1. https://programminghistorian.org/en/lessons/from-html-to-list-of-words-2
    • EN, ES, FR, PT

Changes needed: MINOR In the ‘Python List’ section, we need to change the URL inside a code block from http://www.oldbaileyonline.org/print.jsp?div=t17800628-33 to https://www.oldbaileyonline.org/record/t17800628-33. (I’m not sure what the print component adds to the first URL and whether it is needed in the update too.) Then, perhaps we need to update the list of words received in the output below, but only if they've changed with the new URL.

  1. https://programminghistorian.org/en/lessons/preserving-your-research-data
    • EN, ES, FR, PT

Changes needed: MINOR Where the lesson says: ‘and the Old Bailey uses this format’, the format needs to be updated to reflect the current one. The example URL should be changed as well. Actually, the current example URL is http://www.oldbaileyonline.org/browse.jsp?ref=OA16780417, which doesn’t show any results on the obsolete Old Bailey site. Using name instead of ref does work, though (https://www.dhi.ac.uk/oldbaileyonline/browse.jsp?name=OA16780417). The corresponding URL on the new website is https://www.oldbaileyonline.org/record/OA16780417.

  1. https://programminghistorian.org/en/lessons/keywords-in-context-using-n-grams
    • EN, ES, PT

Changes needed: NONE Although this lesson refers to the Old Bailey, it uses a file which is already available in the lesson's assets directory, so I think it can remain as is

  1. https://programminghistorian.org/en/lessons/normalizing-data
    • EN, ES, PT

Changes needed: MINOR The URL http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33 appears twice and needs to be changed to https://www.oldbaileyonline.org/record/t17800628-33. (Perhaps this URL would also need the &div= component? I don’t know how to recreate this in the new format.)

  1. https://programminghistorian.org/en/lessons/from-html-to-list-of-words-1
    • EN, ES, FR, PT

Changes needed: MINOR The URL http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33 needs to be changed to https://www.oldbaileyonline.org/record/t17800628-33. (Again, perhaps it needs the &div= component?)

  1. https://programminghistorian.org/en/lessons/r-basics-with-tabular-data
    • EN, ES, FR, PT

Changes needed: NONE This lesson teaches how to create matrices with data from the Old Bailey, but never refers directly to the site

  1. https://programminghistorian.org/en/lessons/viewing-html-files
    • EN, ES, FR, PT

Changes needed: NONE This lesson only shows a screenshot of the Old Bailey website and its html code. Although we could update the images to show its modern look and html code, it’s not really necessary for the lesson.

  1. https://programminghistorian.org/en/lessons/working-with-web-pages
    • EN, ES, FR, PT

Changes needed: MAJOR Many URLs need to be updated: http://oldbaileyonline.org/static/Project.jsp -> unsure. https://www.oldbaileyonline.org/search.jsp? form=searchHomePage&_divs_fulltext=arsenic&kwparse=and&_persNames_surname=&_persNames_given=&_persNames_alias=&_offences_offenceCategory_offenceSubcategory=&_verdicts_verdictCategory_verdictSubcategory=&_punishments_punishmentCategory_punishmentSubcategory=&_divs_div0Type_div1Type=&fromMonth=&fromYear=&toMonth=&toYear=&ref=&submit.x=0&submit.y=0 -> unsure. We can probably recreate it by using the Advanced Search functionality in the new website with the same parameters, though. http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33 -> https://www.oldbaileyonline.org/record/t17800628-33 (Bowsey trial).

We must check that the 'little bit of HTML markup' is still correct.

Also, after ‘By studying the URL we can learn a few things’, these ‘few things’ have to be reviewed to ensure they are still correct.

[On a different note, this lesson uses Komodo Edit, which we've encountered issues with in other lessons.]

  1. https://programminghistorian.org/en/lessons/downloading-multiple-records-using-query-strings
    • EN, ES, PT

Changes needed: MAJOR

See Issue #3134

  1. https://programminghistorian.org/en/lessons/naive-bayesian
    • EN

Changes needed: MAJOR Changes are needed from 'Downloading trials' onwards: http://www.oldbaileyonline.org/obapi/ob?term0=fromdate_18300114&term1=todate_18391216&count=10&start=211&return=zip -> unsure. Careful changes will be needed to the script which allows you to download more than 10 entries at once, and to the accompanying description. Where it says ‘a file that looks like this:’ (wget1830s.txt), I expect it will look different now due to the changed URLs. After ‘Here’s a snippet from one trial:’, we might need to update it slightly. The XML markup found on the current website for https://www.oldbaileyonline.org/record/t18300114-2 is ever so slightly different. However, I think it will perhaps still work as intended? This will be discovered if the command:

cd ../../baileycode/
python save-trialtxts-by-category.py

still runs the script as desired. If so, then no further changes are needed after this step.

sharonhoward commented 7 months ago

I'm really sorry I've taken so long to catch up with this. I haven't looked at specific lessons yet, but some general observations:

Unfortunately I think many changes required may be more substantial than they seem on the surface, because the site is now completely dynamic using React.js, so pages are no longer HTML that can be downloaded with, say, curl or wget. So for any lessons that involve downloading/scraping pages from the site it's not just a matter of updating URLs. Instead anything like that needs more specialised tools - I don't have any experience with this kind of thing but I gather that in Python the main options are Selenium or Puppeteer.

The main change to the API, I think, is that all results are returned in JSON, whereas trials used to be XML and only certain summary/stats information was in JSON. But the XML is still returned within the JSON (in fact it looks like it returns three different formats - XML, plain text and HTML) and that should be identical to the original API's XML.

The "print.jsp" URLs were used a lot in lessons because they provided a very plain unstyled version of the pages, ideal for programmatic uses. As far as I know, there is no equivalent in the new site.

I haven't yet really looked at how search URLs have changed. I should note that the search engine is completely different now (ElasticSearch instead of MySQL) which as I understand it is much more flexible. The URLs have clearly been simplified and shortened a lot, eg https://www.oldbaileyonline.org/search/crime?offence=kill&verdict=guilty#results - instead of being split up into offenceCategory, offenceSubCategory, etc. This looks as though it may result in some labels for specific offence/verdict categories being changed - I'll need to compare to the originals to make a list.

sharonhoward commented 7 months ago

Hmm yes, new URL for a search for "killing Other": .../search/crime?offence=killOther#results

relevant bit of the old URL: &_offences_offenceCategory_offenceSubcategory=kill_other

It seems quite likely that a lot of URLs will have that sort of change (and probably the API too). Ugh.

anisa-hawes commented 7 months ago

Hello @sharonhoward,

No need to apologise -- I really appreciate your reply to my email, and I am grateful for the time you have taken to review this Issue.

It all sounds very complicated. The need for additional, specialist tools (Selenium or Puppeteer) as well as a deeper understanding of search URLs may also change the 'difficulty' / learning-level of these lessons.

I'll reply in our existing email thread, and we can continue to think about this together.

With many thanks, Anisa

sharonhoward commented 7 months ago

I'll be emailing you @anisa-hawes but I just wanted to record some key changes to search (and API query) URLs as I understand them. Previously for offences/verdicts/sentences in the URL it was always necessary to explicitly spell out category_subcategory. That's no longer necessary except in a few specific contexts.

As an example - the offence "fraud" (subcategory of deception). The relevant bit of a search URL previously looked like this:

&_offences_offenceCategory_offenceSubcategory=deception_fraud

which is replaced by the much shorter and simpler

offence=fraud

Now the top level category is only needed if

a) searching for the whole category, eg: offence=deception

b) searching for subcategories Other or NoDetail

offence=deceptionOther (which previously looked like deception_other)

offence=deceptionNoDetail (new)

"NoDetail" essentially means the same thing as Other but reflects some inconsistencies in the XML - it happens where the offence was tagged without a subcategory (which shouldn't really have happened). I think that previously there was no way to search for these separately at all.

I'm making a list of all the new category and subcategory pairs and how they map on to the original versions.

adamcrymble commented 6 months ago

I wrote all but 1 of these lessons, so I can advise on all of the original learning objectives.

Can I ask why the lessons haven't been flagged to readers as not working in the meantime? I thought we had a workflow to alert people when they couldn't rely on a lesson's contents.

charlottejmc commented 6 months ago

@adamcrymble, thank you for reminding us!

What do you think of this warning message (inspired by the Twitter API warning message)?

The Old Bailey Online’s website has recently been updated. Unfortunately, the various [changes](https://www.oldbaileyonline.org/about/whats-new) that were made mean that many (if not all) elements of this lesson will not work anymore. We are working on adapting the lesson to the new site, but we have no immediate solutions at the moment. [April 2024]

If you and @hawc2 are happy with this, I can work on coordinating translations into ES, FR and PT, and adding them to the lessons listed above.

adamcrymble commented 6 months ago

Thanks @charlottejmc. I think this is really up to @hawc2 as he's the editor. I'd probably say 'examples of this lesson will not work as intended, however the skills described remain relevant and may be adapted to a different example site'.

hawc2 commented 6 months ago

@charlottejmc @adamcrymble this sounds good to me. Here's a version with Adam's suggestion included, and some further edits I made. Feel free to rework further as you see fit:

The Old Bailey Online’s website has recently been updated. Unfortunately, due to the various changes, many (if not all) elements of the example website used in this lesson will not work as described. The methodologies taught by this lesson remain relevant, however, and may be adapted by readers to a different example site. We are working on adapting the lesson to the new Old Bailey Online website, but we have no clear timeline on when the lesson will be updated. [April 2024]

charlottejmc commented 6 months ago

Thank you @adamcrymble and @hawc2 for your input – it's very much appreciated! I'll work on coordinating translations in our other languages.