Normalise author & publication data where possible

sarahmonster commented 4 years ago

I've noticed that the data we have is a bit unpredictable in terms of formatting and structure. A few examples from the searches I keep running:

Note here that the darker text indicates the author, and the lighter text indicates the publication information. Because of the slightly messy data, it's not always apparent that this is the structure here.

A few things I notice:

Spacing and punctuation is unpredictable and unexpected (spaces around colons, no spaces after commas, trailing commas, open braces that don't close, spacing around parentheses, square braces vs parentheses, etc)
Multiple authors are attributed oddly (Wahl, Jan,Tomes, Margot)
Authors sometimes have dates inserted (Phillips, Vernon Sirvilian, 1871-) or needless metadata (Broughton, Oliver M., author.)
Metadata repeated needlessly (Edinburgh (Scotland ), Edinburgh , Edinburgh, Edinburgh (Scotland, Scottish burgh records society Printed for the lordprovost, magistrates andcouncil 1899)

These are all relatively minor issues, but they do make the metadata harder to parse. Are there ways that we could normalise this data? Given that we don't have full control over the data as it comes in, I'm thinking we'd need to build something of a "translation" layer on top of the existing data, so we can parse it a bit prior to outputting it. The trick, of course, is to ensure that we aren't over-correcting the problem and we're only improving the readability of the text—some of these issues might be easier to fix than others.

brizee commented 4 years ago

If we're set on trying to do this algorithmically, first steps would be to start collecting some larger subsets of data that we can try and experiment with - I have a definite fear that any algorithmic approach will end up overzealous though.

stuartlewis commented 4 years ago

Whilst its good to keep an eye on this, I think it isn't an easy one to solve. In a sense, it's a worldwide library problem. We all work to standards, but interpret them differently! The joy(!) of aggregating data from so many varied sources is that the data will naturally not match cleanly like this. This is one of the reasons for trying to keep to quite a sinmple set of core fields which are hard to misinterpret and should translate well between countries / languages / material types: e.g. title and creator.

Having said that... I think one of the (many) benefits this site could bring, is that as the aggregation of data grows, it becomes a richer source that can be used for different purposes, such as cleaning or tidying up data.

brizee commented 4 years ago

So me and Sarah discussed this a while back actually, I had suggested what might be quite nice is if we supported crowdsourcing data correction. We can't easily correct data algorithmically but we can probably identify data we don't understand - if we were to have a separate page where users could work through that and correct it in such a way that (provided enough people agree this is the correct interpretation) it got written back to the database that would be... an interesting project in itself. :smile:

stuartlewis commented 4 years ago

Also what we would ideally want is for the corrections to also be sent back to the source library for them to include in their catalogues, so the next time we get a data feed, the corrected data isn't overwritten.

opentexts / opentexts

Normalise author & publication data where possible #86