Ligatures aren't normalised in PDF or HTML

It looks to me as if it occurs both the XML and HTML -- the first instance is fine, but the second instance in paragraph 6 is a ligature in the XML as well.

We shouldn't attempt to fix this with a straightforward normalisation.

(from section 1.2 of https://unicode.org/reports/tr15/)

None of the transformations have the right handling of both fi and superscripts -- the canonical ones do not get rid of the ligature, and the compatibility ones do not correctly preserve the superscript.

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

We'd need a carefully nuanced converter to make good decisions about the most popular legacy characters, and I'm wary that some ligatures like ɶ might have specific concrete meanings which would be lost by expansion. (https://caselaw.nationalarchives.gov.uk/ewca/civ/2015/541 talks about pronunciation and uses IPA, but doesn't talk about this sound.)

I don't think we'll fix this one quickly.

Thank you very much for the issue, though -- it's very good to have this in mind, particularly when considering search.

nationalarchives / ds-caselaw-ingester

Ligatures aren't normalised in PDF or HTML #145