Open sarahmonster opened 4 years ago
If we're set on trying to do this algorithmically, first steps would be to start collecting some larger subsets of data that we can try and experiment with - I have a definite fear that any algorithmic approach will end up overzealous though.
Whilst its good to keep an eye on this, I think it isn't an easy one to solve. In a sense, it's a worldwide library problem. We all work to standards, but interpret them differently! The joy(!) of aggregating data from so many varied sources is that the data will naturally not match cleanly like this. This is one of the reasons for trying to keep to quite a sinmple set of core fields which are hard to misinterpret and should translate well between countries / languages / material types: e.g. title and creator.
Having said that... I think one of the (many) benefits this site could bring, is that as the aggregation of data grows, it becomes a richer source that can be used for different purposes, such as cleaning or tidying up data.
So me and Sarah discussed this a while back actually, I had suggested what might be quite nice is if we supported crowdsourcing data correction. We can't easily correct data algorithmically but we can probably identify data we don't understand - if we were to have a separate page where users could work through that and correct it in such a way that (provided enough people agree this is the correct interpretation) it got written back to the database that would be... an interesting project in itself. :smile:
Also what we would ideally want is for the corrections to also be sent back to the source library for them to include in their catalogues, so the next time we get a data feed, the corrected data isn't overwritten.
I've noticed that the data we have is a bit unpredictable in terms of formatting and structure. A few examples from the searches I keep running:
Note here that the darker text indicates the author, and the lighter text indicates the publication information. Because of the slightly messy data, it's not always apparent that this is the structure here.
A few things I notice:
These are all relatively minor issues, but they do make the metadata harder to parse. Are there ways that we could normalise this data? Given that we don't have full control over the data as it comes in, I'm thinking we'd need to build something of a "translation" layer on top of the existing data, so we can parse it a bit prior to outputting it. The trick, of course, is to ensure that we aren't over-correcting the problem and we're only improving the readability of the text—some of these issues might be easier to fix than others.