pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
77 stars 17 forks source link

Detecting harmful words in subject indexing #293

Open pkiraly opened 1 year ago

pkiraly commented 1 year ago

This suggestion comes from KBR:

More and more institutions are talking about the ‘harmfull words’ in their catalogue. And how to clean them up/when/where?

In fact there are two things

  • Harmful words in titles of publications (for example a book with th title ‘the dancing negro’: here we will not change title, but maybe add a disclaimer (for what it’s worth)
  • Harmful words in subject indexing (6XX), for example we changed recently a subject term ‘negro art’ to ‘etnic art’ (or something like that).

For the latter, it would be useful if we can ‘upload’ a csv with harmful words (negro, gypsi, roma, Indians, Eskimo, ‘zwarte piet’ (dutch), etc etc) and then the tool does an analysis of the whole catalogue and gives back a csv of idn’s where that harmful words appear (in title of in subject indexing (6XX). Maybe together with the harmful words detected. Or a list of the harmful words detected (like know we get a list of the errors on marc21 validation), with then a csv of the idn. We can use then that list to correct our records.

It is maybe more complicated that that because some words are not harmful in context A, but are harmful in context B.

nichtich commented 1 year ago

This sounds like a specific use case of the more generic issue full text search in selected fields. It requires:

pkiraly commented 1 year ago

Some components are already available:

So we have two options.

Right now fielded term search (only fielded phrase search) is not possible in the web interface, it requires not just a user interface change, but changing how we create the index.