Open pkiraly opened 4 years ago
We now have the --ignorableFields
parameter, does this solve this issue?
--ignorableFields
ignores the whole field, but here what we need is ignoring only subfields.
There is a relevant request from Gent: "Can you disregard field 852 (ind1=4) from the subfield check, as you’ve done for the undefined field check?" which could fit the following pattern: ignoring
So the best would be to rename/improve ignorableFields with two new features:
We have the same requirement but it's a can of worms: especially conditions can get quite complex. One the other hand you already have a language to specify data elements and rules this this could be reused, e.g. --ignore-elements cleanup.yaml
with cleanup.yaml
being like this:
format: MARC
fields:
- name: custom-fields # optional name
path: 900 # element to remove
- path: 040$a # element to remove
rules: # optional rules
- id: 040$a.pattern # optional id
pattern: ^BE-KBR00 # only remove if value matches this pattern
By the way I'd like to also get this as standalone application to filter a file.
yes, good idea.
The syntax implemented in PicaFilter.java
has not been documented yet and it could be extended to common syntax also used to formulate queries. The same syntax could also be used in Catmandu and pica-rs. Here is an excerpt of the documentation of pica-rs (which goes beyond this) I contributed earlier:
The basic building block of filter expressions are field expressions, which consists of a field tag (e.g.
003@
), an optional occurrence (e.g/03
), and a subfield filter.A simple field tag consists of level number (
0
,1
, or2
) followed by two digits and a character (A
toZ
and@
). The dot (.
) can be used as wildcard for any character and square brackets can be used for alternative characters (e.g.04[45].
matches all fields starting with044
or045
but no occurrence).Occurrence
/00
and no occurence are equivalent,/*
matches all occurrences (including zero) and/01-10
matches any occurrences between/01
and/10
. Exception: if the field tag starts with2
, no occurrence is read as/*
instead of/00
.Simple subfield filter consists of the subfield code (single alpha-numerical character, ex
0
) a comparison operator (equal==
, not equal!=
not equal, starts with prefix=^
, ends with suffix=$
, regex=~
/!~
,in
andnot in
) and a value enclosed in single quotes. These simple subfield expressions can be grouped in parentheses and combined with boolean connectives (ex.(0 == 'abc' || 0 == 'def')
).A special existence operator can be used to check if a given field (
012A/00?
) or a subfield (002@$0?
or002@.0?
) exists. To test for the number of times a field or subfield exists in a record or field respectively, use the cardinality operator#
with a comparison operator (e.g.#010@ > 1
).Field expressions can be combined to complex expressions by the boolean connectives AND (
&&
) and OR (||
). Boolean expressions can be grouped with parenthesis. Precedence of AND is higher than OR, soA || B && C
is equivalent toA || (B && C)
. Expressions are evaluated lazy from left to right so givenA || B
ifA
is true thanB
will not be evaluated.
Radek Světlík (Education and Research Library in Pilsen, Czech Republic) wrote: " I would like to ask whether you can recommend how to ommit delibarately some fields from validation".
The solution would be a new parameter, called
--ignore-elements
which would accept a list of tags and subfields separated by a colon, such as