QA catalogue is a set of software packages for bibliographical record quality assessment. It reads MARC or PICA files (in different formats), analyses some quality dimensions, and saves the results into CSV files. These CSV files could be used in different context, we provide a lightweight, web-based user interface for that. Some of the functionalities are available as a web service, so the validation could be built into a cataloguing/quality assessment workflow.
Screenshot from the web UI of the QA catalogue
See INSTALL.md
for dependencies.
wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.6.0/metadata-qa-marc-0.6.0-release.zip
unzip metadata-qa-marc-0.6.0-release.zip
cd metadata-qa-marc-0.6.0/
Either use the script qa-catalogue
or create configuration files:
cp setdir.sh.template setdir.sh
Change the input and output base directories in setdir.sh
. Local directories input/
and output/
will be used by
default. Files of each catalogue are in a subdirectory of theses base directories:
A more detailed instruction how to use qa-catalogue with Docker can be found in the wiki
A Docker image bundling qa-catalogue with all of its dependencies and the web interface qa-catalogue-web is made available:
continuously via GitHub as ghcr.io/pkiraly/qa-catalogue
and for releases via Docker Hub as pkiraly/metadata-qa-marc
To download, configure and start an image in a new container the file docker-compose.yml is needed in the current directory. It can be configured with the following environment variables:
IMAGE
: which Docker image to download and run. By default the latest
image from Docker Hub is used (pkiraly/metadata-qa-marc
). Alternatives include
IMAGE=ghcr.io/pkiraly/qa-catalogue:main
for most recent image from GitHub packagesIMAGE=metadata-qa-marc
if you have locally build the Docker imageCONTAINER
: the name of the docker container. Default: metadata-qa-marc
.
INPUT
: Base directory to put your bibliographic record files in subdirectories of.
Set to ./input
by default, so record files are expected to be in input/$NAME
.
OUTPUT
: Base directory to put result of qa-catalogue in subdirectory of.
Set to ./output
by default, so files are put in output/$NAME
.
WEBCONFIG
: directory to expose configuration of qa-catalogue-web. Set to
./web-config
by default. If using non-default configuration for data analysis
(for instance PICA instead of MARC) then you likely need to adjust configuration
of the web interface as well. This directory should contain a configuration file
configuration.cnf
.
WEBPORT
: port to expose the web interface. For instance WEBPORT=9000
will
make it available at http://localhost:9000/ instead of http://localhost/.
SOLRPORT
: port to expose Solr to. Default: 8983
.
Environment variables can be set on command line or be put in local file .env
, e.g.:
WEBPORT=9000 docker compose up -d
or
docker compose --env-file config.env up -d
When the application has been started this way, run analyses with script
./docker/qa-catalogue
the same ways as script
./qa-catalogue
is called when not using Docker (see usage for
details). The following example uses parameters for Gent university library
catalogue:
./docker/qa-catalogue \
--params "--marcVersion GENT --alephseq" \
--mask "rug01.export" \
--catalogue gent \
all
Now you can reach the web interface (qa-catalogue-web) at
http://localhost:80/ (or at another port as configured with
environment variable WEBPORT
). To further modify appearance of the interface,
create templates
in your WEBCONFIG
directory and/or create a file configuration.cnf
in
this directory to extend UI configuration without having to restart the Docker container.
This example works under Linux. Windows users should consult the Docker on Windows wiki page. Other useful Docker commands at QA catalogue's wiki.
Everything else should work the same way as in other environments, so follow the next sections.
catalogues/[abbreviation-of-your-library].sh all-analyses
catalogues/[abbreviation-of-your-library].sh all-solr
For a catalogue with around 1 million record the first command will take 5-10 minutes, the later 1-2 hours.
Prerequisites: Java 11 (I use OpenJDK), and Maven 3
git clone https://github.com/pkiraly/metadata-qa-api.git
cd metadata-qa-api
mvn clean install
cd ..
git clone https://github.com/pkiraly/metadata-qa-marc.git
cd metadata-qa-marc
mvn clean install
The released versions of the software is available from Maven Central
repository. The stable releases (currently 0.6.0) is available from all Maven
repos, while the developer version (*-SNAPSHOT) is available only from the
[Sonatype Maven repository]
(https://oss.sonatype.org/content/repositories/snapshots/de/gwdg/metadataqa/metadata-qa-marc/0.5.0/).
What you need to select is the file metadata-qa-marc-0.6.0-jar-with-dependencies.jar
.
Be aware that no automation exists for creating a current developer version as nightly build, so there is a chance that the latest features are not available in this version. If you want to use the latest version, do build it.
Since the jar file doesn't contain the helper scripts, you might also consider downloading them from this GitHub repository:
wget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/common-script
wget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/validator
wget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/formatter
wget https://raw.githubusercontent.com/pkiraly/metadata-qa-marc/master/tt-completeness
You should adjust common-script
to point to the jar file you just downloaded.
The tool comes with some bash helper scripts to run all these with default
values. The generic scripts locate in the root directory and library specific
configuration like scripts exist in the catalogues
directory. You can find
predefined scripts for several library catalogues (if you want to run it, first
you have to configure it). All these scrips mainly contain configuration, and
then it calls the central common-script
which contains the functions.
If you do not want to
catalogues/[your script] [command(s)]
or
./qa-catalogue --params="[options]" [command(s)]
The following commands are supported:
validate
-- runs validationcompleteness
-- runs completeness analysisclassifications
-- runs classification analysisauthorities
-- runs authorities analysistt-completeness
-- runs Thomson-Trail completeness analysisshelf-ready-completeness
-- runs shelf-ready completeness analysisserial-score
-- calculates the serial scoresformat
-- runs formatting recordsfunctional-analysis
-- runs functional analysispareto
-- runs pareto analysismarc-history
-- generates cataloguing history chartprepare-solr
-- prepare Solr index (you should already have Solr running, and index created)index
-- runs indexing with Solrsqlite
-- import tables to SQLiteexport-schema-files
-- export schema filesall-analyses
-- run all default analysis tasksall-solr
-- run all indexing tasksall
-- run all tasksconfig
-- show configuration of selected catalogueYou can find information about these functionalities below this document.
create the configuration file (setdir.sh)
cp setdir.sh.template setdir.sh
edit the file configuration file. Two lines are important here
BASE_INPUT_DIR=your/path
BASE_OUTPUT_DIR=your/path
BASE_INPUT_DIR
is the parent directory where your MARC records existsBASE_OUTPUT_DIR
is where the analysis results will be storedHere is an example file for analysing Library of Congress' MARC records
#!/usr/bin/env bash
. ./setdir.sh
NAME=loc
MARC_DIR=${BASE_INPUT_DIR}/loc/marc
MASK=*.mrc
. ./common-script
Three variables are important here:
NAME
is a name for the output directory. The analysis result will land
under $BASE_OUTPUT_DIR/$NAME directoryMARC_DIR
is the location of MARC files. All the files should be in the
same directoryMASK
is a file mask, such as *.mrc
, *.marc
or *.dat.gz
. Files ending with .gz
are uncompressed automatically.You can add here any other parameters this document mentioned at the description of individual command, wrapped in TYPE_PARAMS variable e.g. for the Deutche Nationalbibliothek's config file, one can find this
TYPE_PARAMS="--marcVersion DNB --marcxml"
This line sets the DNB's MARC version (to cover fields defined within DNB's MARC version), and XML as input format.
The following table summarizes the configuration variables. The script
qa-catalogue
can be used to set variables and execute analysis without a
library specific configuration file:
variable | qa-catalogue |
description | default |
---|---|---|---|
ANALYSES |
-a /--analyses |
which tasks to run with all-analyses |
validate, validate_sqlite, completeness, completeness_sqlite, classifications, authorities, tt_completeness, shelf_ready_completeness, serial_score, functional_analysis, pareto, marc_history |
-c /--catalogue |
display name of the catalogue | $NAME |
|
NAME |
-n /--name |
name of the catalogue | qa-catalogue |
BASE_INPUT_DIR |
-d /--input |
parent directory of input file directories | ./input |
INPUT_DIR |
-d /--input-dir |
subdirectory of input directory to read files from | |
BASE_OUTPUT_DIR |
-o /--output |
parent output directory | ./output |
MASK |
-m /--mask |
a file mask which input files to process, e.g. *.mrc |
* |
TYPE_PARAMS |
-p /--params |
parameters to pass to individual tasks (see below) | |
SCHEMA |
-s /--schema |
record schema | MARC21 |
UPDATE |
-u /--update |
optional date of input files | |
VERSION |
-v /--version |
optional version number/date of the catalogue to compare changes | |
WEB_CONFIG |
-w /--web-config |
update the specified configuration file of qa-catalogue-web | |
-f /--env-file |
configuration file to load environment variables from (default: .env ) |
We will use the same jar file in every command, so we save its path into a variable.
export JAR=target/metadata-qa-marc-0.7.0-jar-with-dependencies.jar
Most of the analyses uses the following general parameters
--schemaType <type>
metadata schema type. The supported types are:
MARC21
PICA
UNIMARC
(assessment of UNIMARC records are not yet supported, this
parameter value is only reserved for future usage)-m <version>
, --marcVersion <version>
specifies a MARC version.
Currently, the supported versions are:
MARC21
, Library of Congress MARC21DNB
, the Deuthche Nationalbibliothek's MARC versionOCLC
, the OCLCMARCGENT
, fields available in the catalog of Gent University (Belgium)SZTE
, fields available in the catalog of Szegedi Tudományegyetem (Hungary)FENNICA
, fields available in the Fennica catalog of Finnish National LibraryNKCR
, fields available at the National Library of the Czech RepublicBL
, fields available at the British LibraryMARC21NO
, fields available at the MARC21 profile for Norwegian public librariesUVA
, fields available at the University of Amsterdam LibraryB3KAT
, fields available at the B3Kat union catalogue of Bibliotheksverbundes Bayern (BVB)
and Kooperativen Bibliotheksverbundes Berlin-Brandenburg (KOBV)KBR
, fields available at KBR, the national library of BelgiumZB
, fields available at Zentralbibliothek ZürichOGYK
, fields available at Országygyűlési Könyvtár, Budapest-n
, --nolog
do not display log messages-i [record ID]
, --id [record ID]
validates only a single record
having the specifies identifier (the content of 001)-l [number]
, --limit [number]
validates only given number of
records-o [number]
, --offset [number]
starts validation at the given
Nth record-z [list of tags]
, --ignorableFields [list of tags]
do NOT
validate the selected fields. The list should contain the tags
separated by commas (,
), e.g. --ignorableFields A02,AQN
-v [selector]
, --ignorableRecords [selector]
do NOT validate
the records which match the condition denoted by the selector.
The selector is a test MARCspec string e.g.
--ignorableRecords STA$a=SUPPRESSED
. It ignores the records which
has STA
field with an a
subfield with the value SUPPRESSED
.-d [record type]
, --defaultRecordType [record type]
the default record
type to be used if the record's type is undetectable. The record type is
calculated from the combination of Leader/06 (Type of record) and Leader/07
(bibliographic level), however sometimes the combination doesn't fit to the
standard. In this case the tool will use the given record type. Possible
values of the record type argument:
-q
, --fixAlephseq
sometimes ALEPH export contains '^' characters
instead of spaces in control fields (006, 007, 008). This flag replaces
them with spaces before the validation. It might occur in any input format.-a
, --fixAlma
sometimes Alma export contains '#' characters instead of
spaces in control fields (006, 007, 008). This flag replaces them with
spaces before the validation. It might occur in any input format.-b
, --fixKbr
KBR's export contains '#' characters instead spaces in
control fields (006, 007, 008). This flag replaces them with spaces before
the validation. It might occur in any input format.-f <format>
, --marcFormat <format>
The input format. Possible values are
ISO
: Binary (ISO 2709)XML
: MARCXML (shortcuts: -x
, --marcxml
)ALEPHSEQ
: Alephseq (shortcuts: -p
, --alephseq
)LINE_SEPARATED
: Line separated binary MARC where each line contains one
record) (shortcuts: -y
, --lineSeparated
)MARC_LINE
: MARC Line is a line-separated format i.e. it is a text file, where
each line is a distinct field, the same way as MARC records are usually
displayed in the MARC21 standard documentation.MARCMAKER
: MARCMaker formatPICA_PLAIN
: PICA plain (https://format.gbv.de/pica/plain) is a
serialization format, that contains each fields in distinct row.PICA_NORMALIZED
: normalized PICA (https://format.gbv.de/pica/normalized)
is a serialization format where each line is a separate record (by bytecode
0A
). Fields are terminated by bytecode 1E, and subfields are introduced
by bytecode 1F
.-t <directory>
, --outputDir <directory>
specifies the output directory
where the files will be created-r
, --trimId
remove spaces from the end of record IDs in the output files
(some library system add padding
spaces around field value 001 in exported files)-g <encoding>
, --defaultEncoding <encoding>
specify a default encoding of
the records. Possible values:
ISO-8859-1
or ISO8859_1
or ISO_8859_1
UTF8
or UTF-8
MARC-8
or MARC8
-s <datasource>
, --dataSource <datasource>
specify the type of data
source. Possible values:
FILE
: reading from fileSTREAM
: reading from a Java data stream. It is not usable if you use the
tool from the command line, only if
you use it with its API.-c <configuration>
, --allowableRecords <configuration>
if set, criteria
which allows analysis of records. If the record does not met the criteria, it
will be excluded. An individual criterium should be formed as a MarcSpec (for
MARC21 records) or PicaFilter (for PICA records). Multiple criteria might be
concatenated with logical operations: &&
for AND, ||
for OR and !
for
not. One can use parentheses to group logical expressions. An example:
'002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)'
.
Since the criteria might form a complex phase containing spaces, the passing
of which is problematic among multiple scripts, one can apply Base64 encoding.
In this case add base64:
prefix to the parameters, such as
base64:"$(echo '002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)' | base64 -w 0)
.-1 <type>
, --alephseqLineType <type>
, true, "Alephseq line type. The type
could be
WITH_L
: the records' AlephSeq lines contain an L
string
(e.g. 000000002 008 L 780804s1977^^^^enk||||||b||||001^0|eng||
)WITHOUT_L
: the records' AlephSeq lines do not contai an L
string
(e.g. 000000002 008 780804s1977^^^^enk||||||b||||001^0|eng||
)-2 <path>
, --picaIdField <path>
the record identifier-u <char>
, --picaSubfieldSeparator <char>
the PICA subfield separator.
subfield of PICA records. Default is 003@$0
.
Default is $
.-j <file>
, --picaSchemaFile <file>
an Avram schema file, which describes
the structure of PICA records-k <path>
, --picaRecordType <path>
The PICA subfield which stores the
record type information. Default is 002@$0
.-e <path>
, --groupBy <path>
group the results by the value of this data
element (e.g. the ILN of libraries holding the item). An example: --groupBy 001@$0
where 001@$0
is the subfield containing the comma separated list of library ILN codes.-3 <file>
, --groupListFile <file>
the file which contains a list of ILN codesThe last argument of the commands are a list of files. It might contain any wildcard the operating system supports ('*', '?', etc.).
It validates each records against the MARC21 standard, including those local defined field, which are selected by the MARC version parameter.
The issues are classified into the following categories: record, control field, data field, indicator, subfield and their subtypes.
There is an uncertainty in the issue detection. Almost all library catalogues have fields, which are not part of the MARC standard, neither that of their documentation about the locally defined fields (these documents are rarely available publicly, and even if they are available sometimes they do not cover all fields). So if the tool meets a field which are undefined, it is impossible to decide whether it is valid or invalid in a particular context. So in some places the tool reflects this uncertainty and provides two calculations, one which handles these fields as error, and another which handles these as valid fields.
The tool detects the following issues:
machine name | explanation |
---|---|
record level issues | |
undetectableType |
the document type is not detectable |
invalidLinkage |
the linkage in field 880 is invalid |
ambiguousLinkage |
the linkage in field 880 is ambiguous |
control field position issues | |
obsoleteControlPosition |
the code in the position is obsolete (it was valid in a previous version of MARC, but it is not valid now) |
controlValueContainsInvalidCode |
the code in the position is invalid |
invalidValue |
the position value is invalid |
data field issues | |
missingSubfield |
missing reference subfield (880$6) |
nonrepeatableField |
repetition of a non-repeatable field |
undefinedField |
the field is not defined in the specified MARC version(s) |
indicator issues | |
obsoleteIndicator |
the indicator value is obsolete (it was valid in a previous version of MARC, but not in the current version) |
nonEmptyIndicator |
indicator that should be empty is non-empty |
invalidValue |
the indicator value is invalid |
subfield issues | |
undefinedSubfield |
the subfield is undefined in the specified MARC version(s) |
invalidLength |
the length of the value is invalid |
invalidReference |
the reference to the classification vocabulary is invalid |
patternMismatch |
content does not match the patterns specified by the standard |
nonrepeatableSubfield |
repetition of a non-repeatable subfield |
invalidISBN |
invalid ISBN value |
invalidISSN |
invalid ISSN value |
unparsableContent |
the value of the subfield is not well-formed according to its specification |
nullCode |
null subfield code |
invalidValue |
invalid subfield value |
Usage:
java -cp $JAR de.gwdg.metadataqa.marc.cli.Validator [options] <file>
or with a bash script
./validator [options] <file>
or
catalogues/<catalogue>.sh validate
or
./qa-catalogue --params="[options]" validate
options:
-S
, --summary
: creating a summary report instead of record level reports-H
, --details
: provides record level details of the issues-G <file>
, --summaryFileName <file>
: the name of summary report the
program produces. The file provides a summary of issues, such as the
number of instance and number of records having the particular issue.-F <file>
, --detailsFileName <file>
: the name of report the program
produces. Default is validation-report.txt
. If you use "stdout", it won't
create file, but put results into the standard output.-R <format>
, --format <format>
: format specification of the output. Possible values:text
(default), tab-separated
or tsv
,comma-separated
or csv
-W
, --emptyLargeCollectors
: the output files are created during the
process and not only at the end of it. It helps in memory management if the
input is large, and it has lots of errors, on the other hand the output file
will be segmented, which should be handled after the process.-T
, --collectAllErrors
: collect all errors (useful only for validating
small number of records). Default is turned off.-I <types>
, --ignorableIssueTypes <types>
: comma separated list of issue
types not to collect. The valid values are (for details see the issue types table):
undetectableType
: undetectable typeinvalidLinkage
: invalid linkageambiguousLinkage
: ambiguous linkageobsoleteControlPosition
: obsolete codecontrolValueContainsInvalidCode
: invalid codeinvalidValue
: invalid valuemissingSubfield
: missing reference subfield (880$6)nonrepeatableField
: repetition of non-repeatable fieldundefinedField
: undefined fieldobsoleteIndicator
: obsolete valuenonEmptyIndicator
: non-empty indicatorinvalidValue
: invalid valueundefinedSubfield
: undefined subfieldinvalidLength
: invalid lengthinvalidReference
: invalid classification referencepatternMismatch
: content does not match any patternsnonrepeatableSubfield
: repetition of non-repeatable subfieldinvalidISBN
: invalid ISBNinvalidISSN
: invalid ISSNunparsableContent
: content is not well-formattednullCode
: null subfield codeinvalidValue
: invalid valueOutputs:
count.csv
: the count of bibliographic records in the source dataset
total
1192536
issue-by-category.csv
: the counts of issues by categories. Columns:
id
the identifier of error categorycategory
the name of the categoryinstances
the number of instances of errors within the category (one record might have multiple instances of the same error)records
the number of records having at least one of the errors within the categoryid,category,instances,records
2,control field,994241,313960
3,data field,12,12
4,indicator,5990,5041
5,subfield,571,555
issue-by-type.csv
: the count of issues by types (subcategories).id,categoryId,category,type,instances,records
5,2,control field,"invalid code",951,541
6,2,control field,"invalid value",993290,313733
8,3,data field,"repetition of non-repeatable field",12,12
10,4,indicator,"obsolete value",1,1
11,4,indicator,"non-empty indicator",33,32
12,4,indicator,"invalid value",5956,5018
13,5,subfield,"undefined subfield",48,48
14,5,subfield,"invalid length",2,2
15,5,subfield,"invalid classification reference",2,2
16,5,subfield,"content does not match any patterns",286,275
17,5,subfield,"repetition of non-repeatable subfield",123,120
18,5,subfield,"invalid ISBN",5,3
19,5,subfield,"invalid ISSN",105,105
issue-summary.csv
: details of individual issues including basic statisticsid,MarcPath,categoryId,typeId,type,message,url,instances,records
53,008/33-34 (008map33),2,5,invalid code,'b' in 'b ',https://www.loc.gov/marc/bibliographic/bd008p.html,1,1
70,008/00-05 (008all00),2,5,invalid code,Invalid content: '2023 '. Text '2023 ' could not be parsed at index 4,https://www.loc.gov/marc/bibliographic/bd008a.html,1,1
28,008/22-23 (008map22),2,6,invalid value,| ,https://www.loc.gov/marc/bibliographic/bd008p.html,12,12
19,008/31 (008book31),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,1
17,008/29 (008book29),2,6,invalid value, ,https://www.loc.gov/marc/bibliographic/bd008b.html,1,1
issue-details.csv
: list of issues by record identifiers. It has two columns, the record identifier, and a
complex string, which contains the number of occurrences of each individual issue concatenated by semicolon.recordId,errors
99117335059205508,1:2;2:1;3:1
99117335059305508,1:1
99117335059405508,2:2
99117335059505508,3:1
1:2;2:1;3:1
means that 3 different types of issues are occurred in the record, the firs issue which has issue ID 1
occurred twice, issue ID 2 which occurred once and issue ID 3, which occurred once. The issue IDs can be resolved
from the issue-summary.csv
file's firs column.
issue-details-normalized.csv
: the normalized version of the previous fileid,errorId,instances
99117335059205508,1,2
99117335059205508,2,1
99117335059205508,3,1
99117335059305508,1,1
99117335059405508,2,2
99117335059505508,3,1
issue-total.csv
: the number of issue free records, and number of record having issuestype,instances,records
0,0,251
1,1711,848
2,413,275
where types are
0: records without errors
1: records with any kinds of errors
2: records with errors excluding invalid field errors
issue-collector.csv
: non normalized file of record ids per issues. This is the "inverse" of issue-details.csv
,
it tells you in which records a particular issue occurred.
errorId,recordIds
1,99117329355705508;99117328948305508;99117334968905508;99117335067705508;99117335176005508;...
validation.params.json
: the list of the actual parameters during the running of the validationAn example with parameters used for analysing a PICA dataset. When the input is a complex expression it is displayed here in a parsed format. It also contains some metadata such as the versions of MQFA API and QA catalogue.
{
"args":["/path/to/input.dat"],
"marcVersion":"MARC21",
"marcFormat":"PICA_NORMALIZED",
"dataSource":"FILE",
"limit":-1,
"offset":-1,
"id":null,
"defaultRecordType":"BOOKS",
"alephseq":false,
"marcxml":false,
"lineSeparated":false,
"trimId":true,
"outputDir":"/path/to/_output/k10plus_pica",
"recordIgnorator":{
"criteria":[],
"booleanCriteria":null,
"empty":true
},
"recordFilter":{
"criteria":[],
"booleanCriteria":{
"op":"AND",
"children":[
{
"op":null,
"children":[],
"value":{
"path":{
"path":"002@.0",
"tag":"002@",
"xtag":null,
"occurrence":null,
"subfields":{"type":"SINGLE","input":"0","codes":["0"]},
"subfieldCodes":["0"]
},
"operator":"NOT_MATCH",
"value":"^L"
}
},
{"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^..[iktN]"}},
{"op":"OR","children":[{"op":null,"children":[],"value":{"path":{"path":"002@.0","tag":"002@","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"0","codes":["0"]},"subfieldCodes":["0"]},"operator":"NOT_MATCH","value":"^.v"}},{"op":null,"children":[],"value":{"path":{"path":"021A.a","tag":"021A","xtag":null,"occurrence":null,"subfields":{"type":"SINGLE","input":"a","codes":["a"]},"subfieldCodes":["a"]},"operator":"EXIST","value":null}}],"value":null}
],
"value":null
},
"empty":false
},
"ignorableFields":{
"fields":["001@","001E","001L","001U","001U","001X","001X","002V","003C","003G","003Z","008G","017N","020F","027D","031B","037I","039V","042@","046G","046T","101@","101E","101U","102D","201E","201U","202D"],
"empty":false
},
"stream":null,
"defaultEncoding":null,
"alephseqLineType":null,
"picaIdField":"003@$0",
"picaSubfieldSeparator":"$",
"picaSchemaFile":null,
"picaRecordTypeField":"002@$0",
"schemaType":"PICA",
"groupBy":null,
"detailsFileName":"issue-details.csv",
"summaryFileName":"issue-summary.csv",
"format":"COMMA_SEPARATED",
"ignorableIssueTypes":["FIELD_UNDEFINED"],
"pica":true,
"replacementInControlFields":null,
"marc21":false,
"mqaf.version":"0.9.2",
"qa-catalogue.version":"0.7.0-SNAPSHOT"
}
id-groupid.csv
: the pairs of record identifiers - group identifiers.id,groupId
010000011,0
010000011,77
010000011,2035
010000011,70
010000011,20
Currently, validation detects the following errors:
Leader specific errors:
Leader/19 (leader19) has an invalid value: '4'
)Control field specific errors:
006/01-05 (tag006book01) contains an invalid code: 'n' in ' n '
)006/13 (tag006book13) has an invalid value: ' '
)007/01 (tag007microform01) has an invalid value: ' '
)008/18-22 (tag008book18) contains an invalid code: 'u' in 'u '
)008/06 (tag008all06) has an invalid value: ' '
)Data field specific errors
Unhandled tag: 265
)110 has invalid subfield: s
)110$ind1 has invalid code: '2'
)110$ind2 should be empty, it has '0'
)072$a is not repeatable, however there are 2 instances
)046$a has an invalid value: 'fb-----'
)Errors of specific fields:
045$a error in '2209668': length is not 4 char
)880 refers to field 590, which is not defined
)An example:
Error in ' 00000034 ':
110$ind1 has invalid code: '2'
Error in ' 00000056 ':
110$ind1 has invalid code: '2'
Error in ' 00000057 ':
082$ind1 has invalid code: ' '
Error in ' 00000086 ':
110$ind1 has invalid code: '2'
Error in ' 00000119 ':
700$ind1 has invalid code: '2'
Error in ' 00000234 ':
082$ind1 has invalid code: ' '
Errors in ' 00000294 ':
050$ind2 has invalid code: ' '
260$ind1 has invalid code: '0'
710$ind2 has invalid code: '0'
710$ind2 has invalid code: '0'
710$ind2 has invalid code: '0'
740$ind2 has invalid code: '1'
Error in ' 00000322 ':
110$ind1 has invalid code: '2'
Error in ' 00000328 ':
082$ind1 has invalid code: ' '
Error in ' 00000374 ':
082$ind1 has invalid code: ' '
Error in ' 00000395 ':
082$ind1 has invalid code: ' '
Error in ' 00000514 ':
082$ind1 has invalid code: ' '
Errors in ' 00000547 ':
100$ind2 should be empty, it has '0'
260$ind1 has invalid code: '0'
Errors in ' 00000571 ':
050$ind2 has invalid code: ' '
100$ind2 should be empty, it has '0'
260$ind1 has invalid code: '0'
...
Usage:
catalogues/<catalogue>.sh validate-sqlite
or
./qa-catalogue --params="[options]" validate-sqlite
or
./common-script [options] validate-sqlite
[options] are the same as for validation
If the data is not grouped by libraries (no --groupBy <path>
parameter), it creates the following SQLite3 database
structure and import some of the CSV files into it:
issue_summary
table for the issue-summary.csv
:
It represents a particular type of error
id INTEGER, -- identifier of the error
MarcPath TEXT, -- the location of the error in the bibliographic record
categoryId INTEGER, -- the identifier of the category of the error
typeId INTEGER, -- the identifier of the type of the error
type TEXT, -- the description of the type
message TEXT, -- extra contextual information
url TEXT, -- the url of the definition of the data element
instances INTEGER, -- the number of instances this error occured
records INTEGER -- the number of records this error occured in
issue_details
table for the issue-details.csv
:
Each row represents how many instances of an error occur in a particular bibliographic record
id TEXT, -- the record identifier
errorId INTEGER, -- the error identifier (-> issue_summary.id)
instances INTEGER -- the number of instances of an error in the record
If the dataset is a union catalogue, and the record contains a subfield for the libraries holding the item (there is
--groupBy <path>
parameter), it creates the following SQLite3 database structure and import some of the CSV files
into it:
issue_summary
table for the issue-summary.csv
(it is similar to the other issue_summary table, but it has an extra
groupId
column)
groupId INTEGER,
id INTEGER,
MarcPath TEXT,
categoryId INTEGER,
typeId INTEGER,
type TEXT,
message TEXT,
url TEXT,
instances INTEGER,
records INTEGER
issue_details
table (same as the other issue_details
table)
id TEXT,
errorId INTEGER,
instances INTEGER
id_groupid
table for id-groupid.csv
:
id TEXT,
groupId INTEGER
issue_group_types
table contains statistics for the error types per groups.
groupId INTEGER,
typeId INTEGER,
records INTEGER,
instances INTEGER
issue_group_categories
table contains statistics for the error categories per groups
groupId INTEGER,
categoryId INTEGER,
records INTEGER,
instances INTEGER
issue_group_paths
table contains statistics for the error types per paths per groups
groupId INTEGER,
typeId INTEGER,
path TEXT,
records INTEGER,
instances INTEGER
For union catalogues it also creates an extra Solr index with the suffix _validation
. It contains one Solr document
for each bibliographic record with three fields: the record identifier, the list of group identifiers and the list
of error identifiers (if any). This Solr index is needed for populating the issue_group_types
, issue_group_categories
and issue_group_paths
tables. This index will be ingested into the main Solr index.
java -cp $JAR de.gwdg.metadataqa.marc.cli.Formatter [options] <file>
or with a bash script
./formatter [options] <file>
options:
-f
, --format
: the name of the format (at time of writing there is no any)-c <number>
, -countNr <number>
: count number of the record (e.g. 1 means
the first record)-s [path=query]
, -search [path=query]
: print records matching the query.
The query part is the content of the element. The path should be one of the
following types:
001
, 002
, 003
)Leader/0
, 008/1-2
)655\$2
, 655\$ind1
)tag006book01
)-l <selector>
, --selector <selector>
: one or more MarcSpec or PICA Filter
selectors, separated by ';' (semicolon) character-w
, --withId
: the generated CSV should contain record ID as first field
(default is turned off)-p <separator>
, --separator <separator>
: separator between the parts
(default: TAB)-e <file>
, --fileName <file>
: the name of report the program produces
(default: extracted.csv
)The output of displaying a single MARC record is something like this one:
LEADER 01697pam a2200433 c 4500
001 1023012219
003 DE-101
005 20160912065830.0
007 tu
008 120604s2012 gw ||||| |||| 00||||ger
015 $a14,B04$z12,N24$2dnb
016 7 $2DE-101$a1023012219
020 $a9783860124352$cPp. : EUR 19.50 (DE), EUR 20.10 (AT)$9978-3-86012-435-2
024 3 $a9783860124352
035 $a(DE-599)DNB1023012219
035 $a(OCoLC)864553265
035 $a(OCoLC)864553328
040 $a1145$bger$cDE-101$d1140
041 $ager
044 $cXA-DE-SN
082 04$81\u$a622.0943216$qDE-101$222/ger
083 7 $a620$a660$qDE-101$222sdnb
084 $a620$a660$qDE-101$2sdnb
085 $81\u$b622
085 $81\u$z2$s43216
090 $ab
110 1 $0(DE-588)4665669-8$0http://d-nb.info/gnd/4665669-8$0(DE-101)963486896$aHalsbrücke$4aut
245 00$aHalsbrücke$bzur Geschichte von Gemeinde, Bergbau und Hütten$chrsg. von der Gemeinde Halsbrücke anlässlich des Jubliäums "400 Jahre Hüttenstandort Halsbrücke". [Hrsg.: Ulrich Thiel]
264 1$a[Freiberg]$b[Techn. Univ. Bergakad.]$c2012
300 $a151 S.$bIll., Kt.$c31 cm, 1000 g
653 $a(Produktform)Hardback
653 $aGemeinde Halsbrücke
653 $aHüttengeschichte
653 $aFreiberger Bergbau
653 $a(VLB-WN)1943: Hardcover, Softcover / Sachbücher/Geschichte/Regionalgeschichte, Ländergeschichte
700 1 $0(DE-588)1113208554$0http://d-nb.info/gnd/1113208554$0(DE-101)1113208554$aThiel, Ulrich$d1955-$4edt$eHrsg.
850 $aDE-101a$aDE-101b
856 42$mB:DE-101$qapplication/pdf$uhttp://d-nb.info/1023012219/04$3Inhaltsverzeichnis
925 r $arb
An example for extracting values:
./formatter --selector "008~7-10;008~0-5" \
--defaultRecordType BOOKS \
--separator "," \
--outputDir ${OUTPUT_DIR} \
--fileName marc-history.csv \
${MARC_DIR}/*.mrc
It will put the output into ${OUTPUT_DIR}/marc-history.csv.
Counts basic statistics about the data elements available in the catalogue.
Usage:
java -cp $JAR de.gwdg.metadataqa.marc.cli.Completeness [options] <file>
or with a bash script
./completeness [options] <file>
or
catalogues/<catalogue>.sh completeness
or
./qa-catalogue --params="[options]" completeness
options:
-R <format>
, --format <format>
: format specification of the output.
Possible values are:
tab-separated
or tsv
,comma-separated
or csv
,text
or txt
json
-V
, --advanced
: advanced mode (not yet implemented)-P
, --onlyPackages
: only packages (not yet implemented)Output files:
marc-elements.csv
: is list of MARC elements (field$subfield) and their occurrences in two ways:
documenttype
: the document types found in the dataset. There is an extra document type: all
representing all
recordspath
: the notation of the data elementpackageid
and package
: each path belongs to one package, such as Control Fields
, and each package has an
internal identifier.tag
: the label of tagsubfield
: the label of subfieldnumber-of-record
: means how many records they are available,number-of-instances
: means how many instances are there in total (some records might contain more than one
instances, while others don't have them at all)min
, max
, mean
, stddev
the minimum, maximum, mean and standard deviation of the number of instances per
record (as floating point numbers)histogram
: the histogram of the instances (1=1; 2=1
means: a single instance is available in one record, two
instances are available in one record)documenttype | path | packageid | package | tag | subfield | number-of-record | number-of-instances | min | max | mean | stddev | histogram |
---|---|---|---|---|---|---|---|---|---|---|---|---|
all | leader23 | 0 | Control Fields | Leader | Undefined | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | leader22 | 0 | Control Fields | Leader | Length of the implementation-defined portion | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | leader21 | 0 | Control Fields | Leader | Length of the starting-character-position portion | 1099 | 1099 | 1 | 1 | 1.0 | 0.0 | 1=1099 |
all | 110$a | 2 | Main Entry | Main Entry - Corporate Name | Corporate name or jurisdiction name as entry element | 4 | 4 | 1 | 1 | 1.0 | 0.0 | 1=4 |
all | 340$b | 5 | Physical Description | Physical Medium | Dimensions | 2 | 3 | 1 | 2 | 1.5 | 0.3535533905932738 | 1=1; 2=1 |
all | 363$a | 5 | Physical Description | Normalized Date and Sequential Designation | First level of enumeration | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
all | 340$a | 5 | Physical Description | Physical Medium | Material base and configuration | 2 | 3 | 1 | 2 | 1.5 | 0.3535533905932738 | 1=1; 2=1 |
packages.csv
: the completeness of packages.
documenttype
: the document type of the recordpackageid
: the identifier of the packagename
: name of the packagelabel
: label of the packageiscoretag
: does the package belong to the Library of Congress MARC standardcount
: the number of records having at least one data element from this packagedocumenttype | packageid | name | label | iscoretag | count |
---|---|---|---|---|---|
all | 1 | 01X-09X | Numbers and Code | true | 1099 |
all | 2 | 1XX | Main Entry | true | 816 |
all | 6 | 4XX | Series Statement | true | 358 |
all | 5 | 3XX | Physical Description | true | 715 |
all | 8 | 6XX | Subject Access | true | 514 |
all | 4 | 25X-28X | Edition, Imprint | true | 1096 |
all | 7 | 5XX | Note | true | 354 |
all | 0 | 00X | Control Fields | true | 1099 |
all | 99 | unknown | unknown origin | false | 778 |
libraries.csv
: list the content of the 852$a (it is useful only if the catalog is an aggregated catalog)
library
: the code of a librarycount
: the number of records having a particular library codelibrary | count |
---|---|
"00Mf" | 713 |
"British Library" | 525 |
"Inserted article about the fires from the Courant after the title page." | 1 |
"National Library of Scotland" | 310 |
"StEdNL" | 1 |
"UkOxU" | 33 |
libraries003.csv
: list the content of the 003 (it is useful only if the catalog is an aggregated catalog)
library
: the code of a librarycount
: the number of records having a particular library codelibrary | count |
---|---|
"103861" | 1 |
"BA-SaUP" | 143 |
"BoCbLA" | 25 |
"CStRLIN" | 110 |
"DLC" | 3 |
completeness.params.json
: the list of the actual parameters in analysisAn example with parameters used for analysing a MARC dataset. When the input is a complex expression it is displayed here in a parsed format. It also contains some metadata such as the versions of MQFA API and QA catalogue.
{
"args":["/path/to/input.xml.gz"],
"marcVersion":"MARC21",
"marcFormat":"XML",
"dataSource":"FILE",
"limit":-1,
"offset":-1,
"id":null,
"defaultRecordType":"BOOKS",
"alephseq":false,
"marcxml":true,
"lineSeparated":false,
"trimId":false,
"outputDir":"/path/to/_output/",
"recordIgnorator":{
"conditions":null,
"empty":true
},
"recordFilter":{
"conditions":null,
"empty":true
},
"ignorableFields":{
"fields":null,
"empty":true
},
"stream":null,
"defaultEncoding":null,
"alephseqLineType":null,
"picaIdField":"003@$0",
"picaSubfieldSeparator":"$",
"picaSchemaFile":null,
"picaRecordTypeField":"002@$0",
"schemaType":"MARC21",
"groupBy":null,
"groupListFile":null,
"format":"COMMA_SEPARATED",
"advanced":false,
"onlyPackages":false,
"replacementInControlFields":"#",
"marc21":true,
"pica":false,
"mqaf.version":"0.9.2",
"qa-catalogue.version":"0.7.0"
}
For union catalogues the marc-elements.csv
and packages.csv
have a special version:
completeness-grouped-marc-elements.csv
- the same as marc-elements.csv
but with an extra element groupId
groupId
: the library identifier available in the data element specified by the --groupBy
parameter. 0
has
a special meaning: all librariesgroupId | documenttype | path | packageid | package | tag | subfield | number-of-record | number-of-instances | min | max | mean | stddev | histogram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
350 | all | 044K$9 | 50 | PICA+ bibliographic description | "Schlagwortfolgen (GBV, SWB, K10plus)" | PPN | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
350 | all | 044K$7 | 50 | PICA+ bibliographic description | "Schlagwortfolgen (GBV, SWB, K10plus)" | Vorläufiger Link | 1 | 1 | 1 | 1 | 1.0 | 0.0 | 1=1 |
completeness-grouped-packages.csv
- the same as packages.csv
but with an extra element group
group
: the library identifier available in the data element specified by the --groupBy
parameter. 0
has
a special meaning: all librariesgroup | documenttype | packageid | name | label | iscoretag | count |
---|---|---|---|---|---|---|
0 | Druckschriften (einschließlich Bildbänden) | 50 | 0... | PICA+ bibliographic description | false | 987 |
0 | Druckschriften (einschließlich Bildbänden) | 99 | unknown | unknown origin | false | 3 |
0 | Medienkombination | 50 | 0... | PICA+ bibliographic description | false | 1 |
0 | Mikroform | 50 | 0... | PICA+ bibliographic description | false | 11 |
0 | Tonträger, Videodatenträger, Bildliche Darstellungen | 50 | 0... | PICA+ bibliographic description | false | 1 |
0 | all | 50 | 0... | PICA+ bibliographic description | false | 1000 |
0 | all | 99 | unknown | unknown origin | false | 3 |
100 | Druckschriften (einschließlich Bildbänden) | 50 | 0... | PICA+ bibliographic description | false | 20 |
100 | Medienkombination | 50 | 0... | PICA+ bibliographic description | false | 1 |
completeness-groups.csv
: this is available for union catalogues, containing the groups
id
: the group identifiergroup
: the name of the librarycount
: the number of records from the particular libraryid | group | count |
---|---|---|
0 | all | 1000 |
100 | Otto-von-Guericke-Universität, Universitätsbibliothek Magdeburg [DE-Ma9] | 21 |
1003 | Kreisarchäologie Rotenburg [DE-MUS-125322...] | 1 |
101 | Otto-von-Guericke-Universität, Universitätsbibliothek, Medizinische Zentralbibliothek (MZB), Magdeburg [DE-Ma14...] | 6 |
1012 | Mariengymnasium Jever [DE-Je1] | 19 |
id-groupid.csv
: this is the very same file what validation creates. Completeness creates it if it not yet available.The completeness-sqlite
step (which is launched by the completeness
step, but could be launched independently as
well) imports marc-elements.csv
or completeness-grouped-marc-elements.csv
file into marc_elements
table. For the catalogues
without the --groupBy
parameter the groupId
column will be filled by 0
.
groupId INTEGER,
documenttype TEXT,
path TEXT,
packageid INTEGER,
package TEXT,
tag TEXT,
subfield TEXT,
number-of-record INTEGER,
number-of-instances INTEGER,
min INTEGER,
max INTEGER,
mean REAL,
stddev REAL,
histogram TEXT
Kelly Thompson and Stacie Traill recently published their approach to calculate the quality of ebook records coming from different data sources. Their article is Implementation of the scoring algorithm described in Leveraging Python to improve ebook metadata selection, ingest, and management. In Code4Lib Journal, Issue 38, 2017-10-18. http://journal.code4lib.org/articles/12828
java -cp $JAR de.gwdg.metadataqa.marc.cli.ThompsonTraillCompleteness [options] <file>
or with a bash script
./tt-completeness [options] <file>
or
catalogues/[catalogue].sh tt-completeness
or
./qa-catalogue --params="[options]" tt-completeness
options:
-F <file>
, --fileName <file>
: the name of report the program produces.
Default is tt-completeness.csv
.It produces a CSV file like this:
id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date 26X,LC/NLM, \
LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of Publication,noLanguageOrEnglish, \
RDA,total
"010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4
"01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5
"010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5
"010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6
"010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7
"010027734",0,0,3,0,1,2,0,1,2,0,0,0,0,0,0,0,1,0,0,0,10
This analysis is the implementation of the following paper:
Emma Booth (2020) Quality of Shelf-Ready Metadata. Analysis of survey responses and recommendations for suppliers Pontefract (UK): National Acquisitions Group, 2020. p 31. https://nag.org.uk/wp-content/uploads/2020/06/NAG-Quality-of-Shelf-Ready-Metadata-Survey-Analysis-and-Recommendations_FINAL_June2020.pdf
The main purpose of the report is to highlight which fields of the printed and electronic book records are important when the records are coming from different suppliers. 50 libraries participated in the survey, each selected which fields are important to them. The report listed those fields which gets the highest scores.
The current calculation based on this list of essential fields. If all data elements specified are available in the record it gets the full score, if only some of them, it gets a proportional score. E.g. under 250 (edition statement) there are two subfields. If both are available, it gets score 44. If only one of them, it gets the half of it, 22, and if none, it gets 0. For 1XX, 6XX, 7XX and 8XX the record gets the full scores if at least one of those fields (with subfield $a) is available. The total score became the average. The theoretical maximum score would be 28.44, which could be accessed if all the data elements are available in the record.
java -cp $JAR de.gwdg.metadataqa.marc.cli.ShelfReadyCompleteness [options] <file>
with a bash script
./shelf-ready-completeness [options] <file>
or
catalogues/[catalogue].sh shelf-ready-completeness
or
./qa-catalogue --params="[options]" shelf-ready-completeness
options:
-F <file>
, --fileName <file>
: the report file name (default is
shelf-ready-completeness.csv
)These scores are calculated for each continuing resources (type of record (LDR/6) is language material ('a') and bibliographic level (LDR/7) is serial component part ('b'), integrating resource ('i') or serial ('s')).
The calculation is based on a slightly modified version of the method published by Jamie Carlstone in the following paper:
Jamie Carlstone (2017) Scoring the Quality of E-Serials MARC Records Using Java, Serials Review, 43:3-4, pp. 271-277, DOI: 10.1080/00987913.2017.1350525 URL: https://www.tandfonline.com/doi/full/10.1080/00987913.2017.1350525
java -cp $JAR de.gwdg.metadataqa.marc.cli.SerialScore [options] <file>
with a bash script
./serial-score [options] <file>
or
catalogues/[catalogue].sh serial-score
or
./qa-catalogue --params="[options]" serial-score
options:
-F <file>
, --fileName <file>
: the report file name. Default is
shelf-ready-completeness.csv
.The Functional Requirements for Bibliographic Records (FRBR) document's main part defines the primary and secondary entities which became famous as FRBR models. Years later Tom Delsey created a mapping between the 12 functions and the individual MARC elements.
Tom Delsey (2002) Functional analysis of the MARC 21 bibliographic and holdings formats. Tech. report. Library of Congress, 2002. Prepared for the Network Development and MARC Standards Office Library of Congress. Second Revision: September 17, 2003. https://www.loc.gov/marc/marc-functional-analysis/original_source/analysis.pdf.
This analysis shows how these functions are supported by the records. Low support means that only small portion of the fields support a function are available in the records, strong support on the contrary means lots of fields are available. The analyses calculate the support of 12 functions for each record, and returns summary statistics.
It is an experimental feature because it turned out, that the mapping covers about 2000 elements (fields, subfields, indicators etc.), however on an average record there are max several hundred elements, which results that even in the best record has about 10-15% of the totality of the elements supporting a given function. So the tool doesn't show you exact numbers, and the scale is not 0-100 but 0-[best score] which is different for every catalogue.
The 12 functions: Discovery functions
Usage functions
Management functions
java -cp $JAR de.gwdg.metadataqa.marc.cli.FunctionalAnalysis [options] <file>
with a bash script
./functional-analysis [options] <file>
or
catalogues/<catalogue>.sh functional-analysis
or
./qa-catalogue --params="[options]" functional-analysis
options:
Output files:
functional-analysis.csv
: the list of the 12 functions and their average
count (number of support fields), and average score (percentage of all
supporting fields available in the record)functional-analysis-mapping.csv
: the mapping of functions and data
elementsfunctional-analysis-histogram.csv
: the histogram of scores and count of
records for each function (e.g. there are x number of records which has
j score for function a)It analyses the coverage of subject indexing/classification in the catalogue. It checks specific fields, which might have subject indexing information, and provides details about how and which subject indexing schemes have been applied.
java -cp $JAR de.gwdg.metadataqa.marc.cli.ClassificationAnalysis [options] <file>
Rscript scripts/classifications/classifications-type.R <output directory>
with a bash script
./classifications [options] <file>
Rscript scripts/classifications/classifications-type.R <output directory>
or
catalogues/[catalogue].sh classifications
or
./qa-catalogue --params="[options]" classifications
options:
-w
, --emptyLargeCollectors
: empty large collectors periodically. It is a
memory optimization parameter, turn it on if you run into a memory problem.The output is a set of files:
classifications-by-records.csv
: general overview of how many records has
any subject indexingclassifications-by-schema.csv
: which subject indexing schemas are available
in the catalogues (such as DDC, UDC, MESH etc.) and where they are referredclassifications-histogram.csv
: a frequency distribution of the number of
subjects available in records (x records have 0 subjects, y records have 1
subjects, z records have 2 subjects etc.)classifications-frequency-examples.csv
: examples for particular
distributions (one record ID which has 0 subject, one which has 1 subject, etc.)classifications-by-schema-subfields.csv
: the distribution of subfields of
those fields, which contains subject indexing information. It gives you a
background that what other contextual information behind the subject term are
available (such as the version of the subject indexing scheme)classifications-collocations.csv
: how many record has a particular set of
subject indexing schemesclassifications-by-type.csv
: returns the subject indexing schemes and their
types in order of the number of records. The types are TERM_LIST (subtypes:
DICTIONARY, GLOSSARY, SYNONYM_RING), METADATA_LIKE_MODEL (NAME_AUTHORITY_LIST,
GAZETTEER), CLASSIFICATION (SUBJECT_HEADING, CATEGORIZATION, TAXONOMY,
CLASSIFICATION_SCHEME), RELATIONSHIP_MODEL (THESAURUS, SEMANTIC_NETWORK,
ONTOLOGY).It analyses the coverage of authority names (persons, organisations, events, uniform titles) in the catalogue. It checks specific fields, which might have authority names, and provides details about how and which schemes have been applied.
java -cp $JAR de.gwdg.metadataqa.marc.cli.AuthorityAnalysis [options] <file>
with a bash script
./authorities [options] <file>
or
catalogues/<catalogue>.sh authorities
or
./qa-catalogue --params="[options]" authorities
options:
-w
, --emptyLargeCollectors
: empty large collectors periodically. It is a
memory optimization parameter, turn it on if you run into a memory problemThe output is a set of files:
authorities-by-records.csv
: general overview of how many records has any
authority namesauthorities-by-schema.csv
: which authority names schemas are available in
the catalogues (such as ISNI, Gemeinsame Normdatei etc.) and where they are
referredauthorities-histogram.csv
: a frequency distribution of the number of
authority names available in records (x records have 0 authority names, y
records have 1 authority name, z records have 2 authority names etc.)authorities-frequency-examples.csv
: examples for particular distributions
(one record ID which has 0 authority name, one which has 1 authority name,
etc.)authorities-by-schema-subfields.csv
: the distribution of subfields of those
fields, which contains authority names information. It gives you a background
that what other contextual information behind the authority names are
available (such as the version of the authority name scheme)This analysis reveals the relative importance of some fields. Pareto's distribution is a kind of power law distribution, and Pareto-rule of 80-20 rules states that 80% of outcomes are due to 20% of causes. In catalogue outcome is the total occurrences of the data element, causes are individual data elements. In catalogues some data elements occurs much more frequently then others. This analyses highlights the distribution of the data elements: whether it is similar to Pareto's distribution or not.
It produces charts for each document type and one for the whole catalogue showing the field frequency patterns. Each chart shows a line which is the function of field frequency: on the X-axis you can see the subfields ordered by the frequency (how many times a given subfield occurred in the whole catalogue). They are ordered by frequency from the most frequent top 1% to the least frequent 1% subfields. The Y-axis represents the cumulative occurrence (from 0% to 100%).
Before running it you should first run the completeness calculation.
With a bash script
catalogues/[catalogue].sh pareto
or
./qa-catalogue --params="[options]" pareto
options:
This analysis is based on Benjamin Schmidt's blog post A brief visual history of MARC cataloging at the Library of Congress. (Tuesday, May 16, 2017).
It produces a chart where the Y-axis is based on the "date entered on file" data element that indicates the date the MARC record was created (008/00-05), the X-axis is based on "Date 1" element (008/07-10).
Usage:
catalogues/[catalogue].sh marc-history
or
./qa-catalogue --params="[options]" marc-history
options:
This is just a helper function which imports the results of validation into SQLite3 database.
The prerequisite of this step is to run validation first, since it uses the
files produced there. If you run validation with catalogues/<catalogue>.sh
or
./qa-catalogue
scripts, this importing step is already covered there.
Usage:
catalogues/[catalogue].sh sqlite
or
./qa-catalogue --params="[options]" sqlite
options:
Output:
qa_catalogue.sqlite
: the SQLite3 database with 3 tables: issue_details
,
issue_groups
, and issue_summary
.Run indexer:
java -cp $JAR de.gwdg.metadataqa.marc.cli.MarcToSolr [options] [file]
With script:
catalogues/[catalogue].sh all-solr
or
./qa-catalogue --params="[options]" all-solr
options:
-S <URL>
, --solrUrl <URL>
: the URL of Solr server including the core (e.g. http://localhost:8983/solr/loc)-A
, --doCommit
: send commits to Solr regularly (not needed if you set up Solr as described below)-T <type>
, --solrFieldType <type>
: a Solr field type, one of the
predefined values. See examples below.
marc-tags
- the field names are MARC codeshuman-readable
- the field names are
Self Descriptive MARC codemixed
- the field names are mixed of the above (e.g. 245a_Title_mainTitle
)-C
, --indexWithTokenizedField
: index data elements as tokenized field as well (each bibliographical data elements
will be indexed twice: once as a phrase (fields suffixed with _ss
), and once as a bag of words (fields suffixed
with _txt
). [This parameter is available from v0.8.0]-D <int>
, --commitAt <int>
: commit index after this number of records [This parameter is available from v0.8.0]-E
, --indexFieldCounts
: index the count of field instances [This parameter is available from v0.8.0]-F
, --fieldPrefix <arg>
: field prefixThe ./index
file (which is used by catalogues/[catalogue].sh
and ./qa-catalogue
scripts) has additional parameters:
-Z <core>
, --core <core>
: The index name (core). If not set it will be extracted from the solrUrl
parameter-Y <path>
, --file-path <path>
: File path-X <mask>
, --file-mask <mask>
: File mask-W
, --purge
: Purge index and exit-V
, --status
: Show the status of index(es) and exit-U
, --no-delete
: Do not delete documents in index before starting indexing (be default the script clears the index)QA catalogue builds a Solr index which contains a) a set of fixed Solr fields that are the same for all bibliographic input, and b) Solr fields that depend on the field names of the metadata schema (MARC, PICA, UNIMARC etc.) - these fields should be mapped from metadata schema to dynamic Solr fields by an algorithm.
id
: the record ID. This comes from the identifier of the bibliographic record, so 001 for MARC21record_sni
: the JSON representation of the bibliographic recordgroupId_is
: the list of group IDs. The content comes from the data element specified by the --groupBy
parameter
split by commas (',').errorId_is
: the list of error IDs that come from the result of the validation.The mapped fields are Solr fields that depend on the field names of the metadata schema. The final Solr field follows the pattern:
);
```
Code is a simple object, it has two property: code and label.
example:
```Java
public class Tag024 extends DataFieldDefinition {
...
ind1 = new Indicator("Type of standard number or code")
.setCodes(...)
.putVersionSpecificCodes(
MarcVersion.SZTE,
Arrays.asList(
new Code(" ", "Not specified")
)
)
...
}
```
2. Defining version specific subfields:
```Java
DataFieldDefinition::putVersionSpecificSubfields(MarcVersion, List)
```
SubfieldDefinition contains a definition of a subfield. You can construct it
with three String parameters: a code, a label and a cardinality code which
denotes whether the subfield can be repeatable ("R") or not ("NR").
example:
```Java
public class Tag024 extends DataFieldDefinition {
...
putVersionSpecificSubfields(
MarcVersion.DNB,
Arrays.asList(
new SubfieldDefinition("9", "Standardnummer (mit Bindestrichen)", "NR")
)
);
}
```
3. Marking indicator codes as obsolete:
```Java
Indicator::setHistoricalCodes(List)
```
The list should be pairs of code and description.
```Java
public class Tag082 extends DataFieldDefinition {
...
ind1 = new Indicator("Type of edition")
.setCodes(...)
.setHistoricalCodes(
" ", "No edition information recorded (BK, MU, VM, SE) [OBSOLETE]",
"2", "Abridged NST version (BK, MU, VM, SE) [OBSOLETE]"
)
...
}
```
4. Marking subfields as obsolete:
```Java
DataFieldDefinition::setHistoricalSubfields(List);
```
The list should be pairs of code and description.
```Java
public class Tag020 extends DataFieldDefinition {
...
setHistoricalSubfields(
"b", "Binding information (BK, MP, MU) [OBSOLETE]"
);
}
```
If you create new a package for the new MArc version, you should register it to several places:
a. add a case into `src/main/java/de/gwdg/metadataqa/marc/Utils.java`:
```Java
case "zbtags": version = MarcVersion.ZB; break;
```
b. add an item into enumeration at `src/main/java/de/gwdg/metadataqa/marc/definition/tags/TagCategory.java`:
```Java
ZB(23, "zbtags", "ZB", "Locally defined tags of the Zentralbibliothek Zürich", false),
```
c. modify the expected number of data elements at `src/test/java/de/gwdg/metadataqa/marc/utils/DataElementsStaticticsTest.java`:
```Java
assertEquals( 215, statistics.get(DataElementType.localFields));
```
d. ... and a `src/test/java/de/gwdg/metadataqa/marc/utils/MarcTagListerTest.java`:
```Java
assertEquals( 2, (int) versionCounter2.get(MarcVersion.ZB));
assertEquals( 2, (int) versionCounter.get("zbtags"));
```
### Appendix III: Institutions which reportedly use this tool
* [Universiteitsbibliotheek Gent](https://lib.ugent.be/), Gent, Belgium
* [Biblioteksentralen](https://www.bibsent.no/), Oslo, Norway
* [Deutsche Digitale Bibliothek](https://www.deutsche-digitale-bibliothek.de/), Frankfurt am Main/Berlin, Germany
* [British Library](https://www.bl.uk/), London/Boston Spa, United Kingdom
* [Országgyűlési Könyvtár](https://www.ogyk.hu/en), Budapest, Hungary
* [Studijní a vědecká knihovna Plzeňského kraje](https://svkpk.cz/), Plzeň, Czech Republic
* [Royal Library of Belgium (KBR)](https://kbr.be/), Brussels, Belgium
* [Gemeinsamer Bibliotheksverbund (GBV)](https://www.gbv.de/informationen/Verbund/), Göttingen, Germany
* [Binghampton University Libraries](https://www.binghamton.edu/libraries/), Binghampton, NY, USA
* [Zentralbibliothek Zürich](https://www.zb.uzh.ch/de), Zürich, Switzerland
If you use this tool as well, please contact me: pkiraly (at) gwdg (dot) de. I
really like to hear about your use case and ideas.
### Appendix IV: Supporters and Sponsors
* [Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)](https://gwdg.de): Hardware, time for
research
* [Gemeinsamer Bibliotheksverbund (GBV)](https://www.gbv.de/informationen/Verbund/): contracting for feature development
* [Royal Library of Belgium (KBR)](https://kbr.be/): contracting for feature development
* [JetBrains s.r.o.](https://www.jetbrains.com/idea/): [IntelliJ IDEA](https://www.jetbrains.com/idea/)
development tool community licence
### Appendix V: Special build process
"deployment" build (when deploying artifacts to Maven Central)
```
mvn clean deploy -Pdeploy
```
### Appendix VI: Build Docker image
Build and test
```bash
# create the Java library
mvn clean install
# create the docker base image
docker compose -f docker/build.yml build app
```
The `docker compose build` command has multiple `--build-arg` arguments to override defaults:
- `QA_CATALOGUE_VERSION`: the QA catalogue version (default: `0.7.0`, current development version is `0.8.0-SNAPSHOT`)
- `QA_CATALOGUE_WEB_VERSION`: it might be a released version such as `0.7.0`, or `main` (default) to use the
main branch, or `develop` to use the develop branch.
- `SOLR_VERSION`: the Apache Solr version you would like to use (default: `8.11.1`)
- `SOLR_INSTALL_SOURCE`: if its value is `remote` docker will download it from http://archive.apache.org/.
If its value is a local path points to a previously downloaded package (named as `solr-${SOLR_VERSION}.zip`
up to version 8.x.x or `solr-${SOLR_VERSION}.tgz` from version 9.x.x) the process will copy it from the
host to the image file. Depending on the internet connection, download might take a long time, using a
previously downloaded package speeds the building process.
(Note: it is not possible to specify files outside the current directory, not using symbolic links, but
you can create hard links - see an example below.)
Using the current developer version:
```bash
docker compose -f docker/build.yml build app \
--build-arg QA_CATALOGUE_VERSION=0.8.0-SNAPSHOT \
--build-arg QA_CATALOGUE_WEB_VERSION=develop \
--build-arg SOLR_VERSION=8.11.3
```
Using a downloaded Solr package:
```bash
# create link temporary
mkdir download
ln ~/Downloads/solr/solr-8.11.3.zip download/solr-8.11.3.zip
# run docker
docker compose -f docker/build.yml build app \
--build-arg QA_CATALOGUE_VERSION=0.8.0-SNAPSHOT \
--build-arg QA_CATALOGUE_WEB_VERSION=develop \
--build-arg SOLR_VERSION=8.11.3 \
--build-arg SOLR_INSTALL_SOURCE=download/solr-8.11.3.zip
# delete the temporary link
rm -rf download
```
Then start the container with environment variable `IMAGE` set to
`metadata-qa-marc` and run analyses [as described above](#with-docker).
For maintainers only:
Upload to Docker Hub:
```bash
docker tag metadata-qa-marc:latest pkiraly/metadata-qa-marc:latest
docker login
docker push pkiraly/metadata-qa-marc:latest
```
Cleaning before and after:
```bash
# stop running container
docker stop $(docker ps --filter name=metadata-qa-marc -q)
# remove container
docker rm $(docker ps -a --filter name=metadata-qa-marc -q)
# remove image
docker rmi $(docker images metadata-qa-marc -q)
# clear build cache
docker builder prune -a -f
```
Feedbacks are welcome!
[![Build Status](https://travis-ci.org/pkiraly/metadata-qa-marc.svg?branch=main)](https://travis-ci.org/pkiraly/metadata-qa-marc)
[![Coverage Status](https://coveralls.io/repos/github/pkiraly/metadata-qa-marc/badge.svg?branch=main)](https://coveralls.io/github/pkiraly/metadata-qa-marc?branch=main)
[![codecov](https://codecov.io/gh/pkiraly/metadata-qa-marc/branch/main/graph/badge.svg?token=dBPIoFd0bz)](https://codecov.io/gh/pkiraly/metadata-qa-marc)
[![javadoc](https://javadoc.io/badge2/de.gwdg.metadataqa/metadata-qa-marc/javadoc.svg)](https://javadoc.io/doc/de.gwdg.metadataqa/metadata-qa-marc)
[![Maven Central](https://img.shields.io/maven-central/v/de.gwdg.metadataqa/metadata-qa-marc.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:%22de.gwdg.metadataqa%22%20AND%20a:%22metadata-qa-marc%22)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6394934.svg)](https://doi.org/10.5281/zenodo.6394934)
[![SonarCloud](https://sonarcloud.io/images/project_badges/sonarcloud-orange.svg)](https://sonarcloud.io/summary/overall?id=pkiraly_metadata-qa-marc)