inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
622 stars 291 forks source link

BibIndex: incremental indexing leak when tags contain indicator wildcard #2696

Open tiborsimko opened 9 years ago

tiborsimko commented 9 years ago

There seem to be an incremental indexing leak trouble when a record is cloned (hence uploaded via bibupload -r) and some of the tags to index are defined with a wildcard on indicator positions (e.g. 245%a').

(When a record is inserted, things seem to work, due to different treatment of affected_fields. But see also #2693.)

Here is how to reproduce the problem.

Let us recreate fresh demo site and index any pending OAI repository jobs:

$ invenio-recreate-demo-site --yes-i-know
$ sudo -u www-data /opt/invenio/bin/bibindex -u admin -v9
$ sudo -u www-data /opt/invenio/bin/bibindex 10
$ sudo -u www-data /opt/invenio/bin/bibindex -u admin -v9
$ sudo -u www-data /opt/invenio/bin/bibindex 11

Let us simulate cloning of a record with default setup first:

$ echo "INSERT INTO bibrec VALUES (1234, NOW(), NOW(), 'marc')" | /opt/invenio/bin/dbexec
$ echo "SELECT MAX(id) FROM bibrec" | /opt/invenio/bin/dbexec
MAX(id)
1234
$ wget -O /tmp/z.xml http://pcuds06.cern.ch/record/34/export/xm
$ grep -v '"005"' /tmp/z.xml | sed 's,>34<,>1234<,g' > /tmp/zz.xml
$ colordiff -wu /tmp/z.xml /tmp/zz.xml
--- /tmp/z.xml  2015-01-22 12:50:50.659423897 +0100
+++ /tmp/zz.xml 2015-01-22 12:51:05.143424540 +0100
@@ -1,8 +1,7 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <collection xmlns="http://www.loc.gov/MARC21/slim">
 <record>
-  <controlfield tag="001">34</controlfield>
-  <controlfield tag="005">20150122124436.0</controlfield>
+  <controlfield tag="001">1234</controlfield>
   <datafield tag="020" ind1=" " ind2=" ">
     <subfield code="a">0387940758</subfield>
   </datafield>
$ sudo -u www-data /opt/invenio/bin/bibupload -r /tmp/zz.xml
$ sudo -u www-data /opt/invenio/bin/bibupload 12
$ echo "SELECT affected_fields FROM hstRECORD WHERE id_bibrec=1234" | /opt/invenio/bin/dbexec
affected_fields
005__%,020__%,041__%,080__%,245__%,260__%,270__%,300__%,909C0%,909C1%,909CS%,980__%

and let us see if incremental indexing works:

$ sudo -u www-data /opt/invenio/bin/bibindex -u admin -v 9
$ sudo -u www-data /opt/invenio/bin/bibindex 13
$ grep -o idxPHRASE[0-9][0-9] /opt/invenio/var/log/bibsched/0/bibsched_task_13.log | sort -u
idxPHRASE02
idxPHRASE07
idxPHRASE08
idxPHRASE10
idxPHRASE19
idxPHRASE26
$ echo "SELECT id,name FROM idxINDEX WHERE id IN (2,7,10,19,26)" | /opt/invenio/bin/dbexec
id      name
2       collection
7       reportnumber
8       title
10      year
19      exacttitle
26      miscellaneous

Looks good: not all field indexes were updated, only the ones that the record itself actually contained, which corresponds to the list of affected_fields above.

However, let's see what happens when one MARC tag is defined via wildcard:

$ echo "SELECT id,name,value FROM tag WHERE value like '24%'" | /opt/invenio/bin/dbexec
id      name    value
3       main title      245__%
4       additional title        246__%
40      24x     24%
141     title   245__a
142     main abstract   245__a
171     240x    240%
172     242x    242%
173     243x    243%
174     244x    244%
175     247x    247%
$ echo "UPDATE tag SET value='245%a' WHERE value='245__%'" | /opt/invenio/bin/dbexec

and let's simulate cloning another record again and see if incremental indexing works:

$ echo "INSERT INTO bibrec VALUES (1235, NOW(), NOW(), 'marc')" | /opt/invenio/bin/dbexec
$ echo "SELECT MAX(id) FROM bibrec" | /opt/invenio/bin/dbexec
MAX(id)
1235
$ wget -O /tmp/z.xml http://pcuds06.cern.ch/record/34/export/xm
$ grep -v '"005"' /tmp/z.xml | sed 's,>34<,>1235<,g' > /tmp/zz.xml
$ colordiff -uw /tmp/z.xml /tmp/zz.xml
--- /tmp/z.xml  2015-01-22 12:55:51.047437225 +0100
+++ /tmp/zz.xml 2015-01-22 12:55:55.375437417 +0100
@@ -1,8 +1,7 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <collection xmlns="http://www.loc.gov/MARC21/slim">
 <record>
-  <controlfield tag="001">34</controlfield>
-  <controlfield tag="005">20150122124436.0</controlfield>
+  <controlfield tag="001">1235</controlfield>
   <datafield tag="020" ind1=" " ind2=" ">
     <subfield code="a">0387940758</subfield>
   </datafield>
$ sudo -u www-data /opt/invenio/bin/bibupload -r /tmp/zz.xml
$ sudo -u www-data /opt/invenio/bin/bibupload 14
$ echo "SELECT affected_fields FROM hstRECORD WHERE id_bibrec=1235" | /opt/invenio/bin/dbexec
affected_fields
005__%,020__%,041__%,080__%,245__%,260__%,270__%,300__%,909C0%,909C1%,909CS%,980__%
$ sudo -u www-data /opt/invenio/bin/bibindex -u admin -v 9
$ sudo -u www-data /opt/invenio/bin/bibindex 15
$ grep -o idxPHRASE[0-9][0-9] /opt/invenio/var/log/bibsched/0/bibsched_task_15.log | sort -u
idxPHRASE02
idxPHRASE07
idxPHRASE10
idxPHRASE26

Oops! Only some indexes are updated, notably title (id=8) and exacttitle (id=19) were not updated, now that they work on 245%a rather than 245__%.

Kennethhole commented 8 years ago

Here are some new use cases where we have a similar problems:

_Records are partial indexes for new uploads._ A search in the specific index will not find the title, authors or imprint. A search in the global index will find author, imprint and 035.

The indexes is structured in the following way: Title 245% Author: 100%, 700% Imprint: 264% miscellaneous: 035%, 26%, 100a,700a (default settings)

echo "SELECT job_date,affected_fields FROM hstRECORD WHERE id_bibrec=862175" | /opt/invenio/bin/dbexec
job_date    affected_fields
2015-11-23 13:27:55 003__%,005__%,008__%,010__%,020__%,035__%,040__%,049__%,05000%,08200%,1001_%,24510%,250__%,264_1%,300__%,336__%,337__%,338__%,504__%,650_0%,7001_%
2015-11-23 13:29:05 005__%,948__%
2015-11-23 13:29:35 005__%,980__%
2015-11-23 13:30:10 005__%,998__%

Relevant metadata:

<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Cox, Michael M.</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">(OCoLC)905380069</subfield>
</datafield>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Molecular biology :</subfield>
<subfield code="b">principles and practice /</subfield>
<subfield code="c">Michael M. Cox., University of Wisconsin-Madison, Jennifer A. Doudna, University of California, Berkeley, Michael O"Donnell, The Rockefeller University.
</subfield>
</datafield>
<datafield tag="264" ind1=" " ind2="1">
<subfield code="a">New York :</subfield>
<subfield code="b">W.H. Freeman & Company, a Macmillan Education Imprint,
</subfield>
<subfield code="c">[2015]</subfield>
</datafield>
<datafield tag="700" ind1="1" ind2=" ">
<subfield code="a">Doudna, Jennifer A.</subfield>
</datafield>
<datafield tag="700" ind1="1" ind2=" ">
<subfield code="a">O"Donnell, Michael</subfield>
<subfield code="c">(Biochemist)</subfield>
</datafield>

If we re-index everything, it becomes searchable.

_Records are not indexed after modifications_ A meeting name has been updated from - to

1112_ $$aInternational Telecommunication Conference$$d(1947 :$$cAtlantic City, United States of America)
1112_ $$aInternational Telecommunications Conference$$d(1947 :$$cAtlantic City, United States of America)

It is the word Telecommunications which is modified.

It is not searchable in a specific meeting name index (111%), but in the global search, which covers the miscellaneous index (11%).

echo "SELECT job_date,affected_fields FROM hstRECORD WHERE id_bibrec=12079" | /opt/invenio/bin/dbexec
job_date    affected_fields
2015-08-18 18:44:42 000__%,005__%,008__%,1112_%,24510%,24602%,260__%,300__%,500__%,5050_%,518__%,650_4%,7102_%,7112_%,7670_%,8528_%,902__%,980__%
2015-10-08 18:19:33 000__%,005__%,008__%,1112_%,24510%,24602%,260__%,300__%,500__%,50500%,518__%,650_4%,7102_%,7112_%,7670_%,8528_%,902__%,980__%
2015-10-14 16:01:33 005__%,8528_%
2015-10-14 16:07:24 005__%,8528_%
2015-10-14 16:09:09 005__%,8528_%
2015-10-16 10:17:00 005__%,8528_%
2015-11-18 14:56:37 005__%,1112_%