pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
81 stars 17 forks source link

Add solr field with number of times a field is used in a record #342

Open nichtich opened 1 year ago

nichtich commented 1 year ago

There seems to be no way to query Solr index for fields alone (except for fields that don't have subfields). To query:

This should be doable with an additional index field holding the number of times, a record field is used in the record.

pkiraly commented 1 year ago

For the name and subject authority PICA/MARC fields there are Solr fields which concatenates all non administrative subfields. They are called *_full_ss. For these fields you can query records that include a given field, you can use this Solr field (e.g. for Allgemeine Systematik für Bibliotheken: 045B_full_ss:*). But evidently this is not what you would like to get.

Staying with this example, we can create a Solr field that has the name as the name of the PICA/MARC field plus a prefix/suffix (say count or instances), and the value is an integer. The queries would look like these:

Is it OK for you? Would you like to see another prefix/suffix or no suffix?

nichtich commented 1 year ago

I prefer short names such as 045B_i but 045B_count_i:* is ok as well.

pkiraly commented 1 year ago

Thanks! I'd add count or similar because from 045B_i my first association is that it contains a value of the field (e.g. a year, or page number) transformed into an integer.

pkiraly commented 1 year ago

It is testable. You should add --indexFieldCounts flag, otherwise the index will not contain the counts.

It uses the field's id + _count_i as the Solr name. The id connects same fields having different occurences, and separates same tags, but different fields. @ and - has been transformed to _. Please suggest alternatives if you dislike this approach. Here is the result (a Solr document):

{
"id": "010531483",
...,
"001__count_i":1,
"001A_count_i":1,
"001B_count_i":1,
"001U_count_i":1,
"002__count_i":1,
"002C_count_i":1,
"002D_count_i":1,
"002E_count_i":1,
"003__count_i":1,
"003O_count_i":1,
"003S_count_i":1,
"003T_count_i":1,
"004A_count_i":1,
"006G_count_i":1,
"006U_count_i":1,
"007G_count_i":1,
"009__count_i":1,
"010__count_i":1,
"011__count_i":1,
"017G_count_i":1,
"017L_count_i":11,
"019__count_i":1,
"021A_count_i":1,
"028A_count_i":1,
"032__count_i":1,
"033A_count_i":1,
"034D_count_i":1,
"034M_count_i":1,
"044K_00_09_count_i":2,
"045E_count_i":1,
"045R_count_i":1,
"046K_count_i":1,
"046X_count_i":1,
"047A_count_i":1,
...
}