pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
81 stars 17 forks source link

Filter out some fields #69

Open pkiraly opened 4 years ago

pkiraly commented 4 years ago

Radek Světlík (Education and Research Library in Pilsen, Czech Republic) wrote: " I would like to ask whether you can recommend how to ommit delibarately some fields from validation".

The solution would be a new parameter, called --ignore-elements which would accept a list of tags and subfields separated by a colon, such as

--ignore-elements "100$a;650;651;700$a;700$2"
nichtich commented 1 year ago

We now have the --ignorableFields parameter, does this solve this issue?

pkiraly commented 1 year ago

--ignorableFields ignores the whole field, but here what we need is ignoring only subfields. There is a relevant request from Gent: "Can you disregard field 852 (ind1=4) from the subfield check, as you’ve done for the undefined field check?" which could fit the following pattern: ignoring if happens.

So the best would be to rename/improve ignorableFields with two new features:

nichtich commented 1 year ago

We have the same requirement but it's a can of worms: especially conditions can get quite complex. One the other hand you already have a language to specify data elements and rules this this could be reused, e.g. --ignore-elements cleanup.yaml with cleanup.yaml being like this:

format: MARC
fields:
- name: custom-fields # optional name
  path: 900 # element to remove
- path: 040$a # element to remove
  rules: # optional rules
  - id: 040$a.pattern # optional id
    pattern: ^BE-KBR00 # only remove if value matches this pattern

By the way I'd like to also get this as standalone application to filter a file.

pkiraly commented 1 year ago

yes, good idea.

nichtich commented 1 year ago

The syntax implemented in PicaFilter.java has not been documented yet and it could be extended to common syntax also used to formulate queries. The same syntax could also be used in Catmandu and pica-rs. Here is an excerpt of the documentation of pica-rs (which goes beyond this) I contributed earlier:

The basic building block of filter expressions are field expressions, which consists of a field tag (e.g. 003@), an optional occurrence (e.g /03), and a subfield filter.

A simple field tag consists of level number (0, 1, or 2) followed by two digits and a character (A to Z and @). The dot (.) can be used as wildcard for any character and square brackets can be used for alternative characters (e.g. 04[45]. matches all fields starting with 044 or 045 but no occurrence).

Occurrence /00 and no occurence are equivalent, /* matches all occurrences (including zero) and /01-10 matches any occurrences between /01 and /10. Exception: if the field tag starts with 2, no occurrence is read as /* instead of /00.

Simple subfield filter consists of the subfield code (single alpha-numerical character, ex 0) a comparison operator (equal ==, not equal != not equal, starts with prefix =^, ends with suffix =$, regex =~/!~, in and not in) and a value enclosed in single quotes. These simple subfield expressions can be grouped in parentheses and combined with boolean connectives (ex. (0 == 'abc' || 0 == 'def')).

A special existence operator can be used to check if a given field (012A/00?) or a subfield (002@$0? or 002@.0?) exists. To test for the number of times a field or subfield exists in a record or field respectively, use the cardinality operator # with a comparison operator (e.g. #010@ > 1).

Field expressions can be combined to complex expressions by the boolean connectives AND (&&) and OR (||). Boolean expressions can be grouped with parenthesis. Precedence of AND is higher than OR, so A || B && C is equivalent to A || (B && C). Expressions are evaluated lazy from left to right so given A || B if A is true than B will not be evaluated.