nationalarchives / ds-caselaw-ingester

Parse judgements from the Transformation Engine and load them into MarkLogic as part of the National Archives Find Case Law service
MIT License
5 stars 1 forks source link

Ligatures aren't normalised in PDF or HTML #145

Open edent opened 8 months ago

edent commented 8 months ago

The PDF of [2024] UKFTT 31 (TC) contains a number of instances of the "fl" ligature (U+FB02).

This is seen repeatedly in the phrase "potato flour":

Screencast from 17-01-24 08:49:19.webm

I do not have access to the original DOCX, although I note the ligature is also present in the PDF judgement on the official Tribunals website.

The ligature is also present in the HTML version but not in the XML version.

I suggest that the text undergoes Unicode Normalisation before a PDF is created.

(Apologies if this isn't the correct repo. Feel free to move it somewhere more suitable.)

dragon-dxw commented 8 months ago

It looks to me as if it occurs both the XML and HTML -- the first instance is fine, but the second instance in paragraph 6 is a ligature in the XML as well.

We shouldn't attempt to fix this with a straightforward normalisation.

image (from section 1.2 of https://unicode.org/reports/tr15/)

None of the transformations have the right handling of both fi and superscripts -- the canonical ones do not get rid of the ligature, and the compatibility ones do not correctly preserve the superscript.

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

We'd need a carefully nuanced converter to make good decisions about the most popular legacy characters, and I'm wary that some ligatures like ɶ might have specific concrete meanings which would be lost by expansion. (https://caselaw.nationalarchives.gov.uk/ewca/civ/2015/541 talks about pronunciation and uses IPA, but doesn't talk about this sound.)

I don't think we'll fix this one quickly.

Thank you very much for the issue, though -- it's very good to have this in mind, particularly when considering search.