metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Erratic behaviour with nested entity and choose in combination with flushWith #366

Open TobiasNx opened 3 years ago

TobiasNx commented 3 years ago

Working with lobid-resources I found a bug.

https://github.com/hbz/lobid-resources/blob/ccc34fc8590340c6fb72d4d9c884746090b7fbf8/src/main/resources/alma/common/subjects.xml#L4-L60

To add a type specific to the second indicator of a marc field in 610 and 650 I have introduced a choose collector with sameEntity="true" reset="true" flushWith="650??.a" (or 610??.a). The wrapping mother entity also has flushWith conditions. Both flushWith as well al the source data refrence are using wild cards.

    <entity name="" flushWith="650??" sameEntity="true" reset="true">
      <entity name="type[]" sameEntity="true" reset="true">
        <choose sameEntity="true" reset="true" flushWith="650??.a">
          <data source="650?0.a">
            <constant value="Concept"/>
          </data>
          <data source="650?4.a">
            <constant value="Keyword"/>
          </data>
        </choose>
      </entity>
...

When I run the lobid morph to test with: mvn failsafe:integration-test -Dit.test=AlmaMarc21XmlToLobidJsonTest

Results for this specific transformation process seem to be erratic. They are inconsistent internally as well as between different test runs. But is does not always create these inconsistencies between runs. https://github.com/hbz/lobid-resources/commit/ccc34fc8590340c6fb72d4d9c884746090b7fbf8

@dr0i saw it live. When we changed the flushWith-attribute of choose to the entity level (610?? instead of 610??.a). The bug does not appear.

blackwinter commented 3 years ago

I don't think I can confirm.

First of all, AlmaMarc21XmlToLobidJsonTest.setup() sets GENERATE_TESTDATA = true, so the test doesn't actually test anything (AFAIUI).

But even after removing that line, executing mvn clean test-compile failsafe:integration-test -Dit.test=AlmaMarc21XmlToLobidJsonTest always yields the exact same comparison failure (see below). No erratic behaviour observed here.

Can you be more specific as to what should be the expected outcome?

expected:<...subject" : [ {
    "[label" : "Berlin",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Bremen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Hamburg",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Nordrhein-Westfalen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Niedersachsen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Rheinland-Pfalz",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Saarland",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {]
    "label" : "Schl...> but was:<...subject" : [ {
    "[type" : [ "Keyword" ],
    "label" : "Berlin",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Bremen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Hamburg",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Nordrhein-Westfalen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Niedersachsen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Rheinland-Pfalz",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Saarland",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],]
    "label" : "Schl...>
        at org.lobid.resources.AlmaMarc21XmlToLobidJsonTest.lambda$transformFiles$1(AlmaMarc21XmlToLobidJsonTest.java:174)
        at org.lobid.resources.AlmaMarc21XmlToLobidJsonTest.transformFiles(AlmaMarc21XmlToLobidJsonTest.java:137)
blackwinter commented 3 years ago

There's almost definitely something fishy going on... When shuffling the file list, catching the AssertionError and then just printing the failing file names, it becomes apparent that the failures are order-dependent:

(DE-605)HT003176544
(DE-605)TT050421649
(DE-605)HT015671602
(DE-605)HT017411546
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT003176544
(DE-605)HT020202475
(DE-605)HT015671602
(DE-605)HT017411546
(DE-605)HT017411546
(DE-605)HT020202475
(DE-605)HT003176544
(DE-605)TT050421649
(DE-605)TT050421649
(DE-605)HT017411546
(DE-605)TT050421649
(DE-605)HT020202475
(DE-605)HT017411546

While they're stable when iterating in fixed (directory) order:

(DE-605)HT003176544
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT017411546

It's also not limited to the <choose> elements in subjects.xml:

(DE-605)HT017411546 expected:<..." : "Münster",
    "[]publicationHistory" ...> but was:<..." : "Münster",
    "[location" : "Münster",
    "]publicationHistory" ...>

Which originates from a - seemingly - inconspicuous <data> element (via macro regex-del-punctuation-end) in titleRelatedFields.xml.

TobiasNx commented 3 years ago

Can you be more specific as to what should be the expected outcome?

It is expected that 610 and 650 get a "type" : [ "Keyword" ] or "type" : [ "Concept" ] based on the specific second indicator of that field.

the internal inconsistency is shown in this two cases:

(DE-605)HT003176544.json:

    "label" : "Schleswig-Holstein",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Hotel- und Gaststättengewerbe",
    "source" : {
      "label" : "Freie Verschlagwortung",

The 610 and the 650 even if they are the same workflow 610 in this example provides the type and 650 does not.

We only have one 610 testfile.

Unfortunately not shown in the the catched example 650 also does not work always and even can give different results in different testfiles . It then can happen that HT003176544 and HT020202475 have different results with this workflow. We saw the case that: While HT003176544: "label" : "Hotel- und Gaststättengewerbe" has no type HT020202475: "label" : "Critical theory" and "label" : "Electronic books" have or the other way around.

(DE-605)HT020202475.json

  "subject" : [ {
    "type" : [ "Keyword" ],
    "label" : "Critical theory",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Electronic books",
    "source" : {
      "label" : "Freie Verschlagwortung",

Your second case:

It's also not limited to the elements in subjects.xml:

(DE-605)HT017411546 expected:<..." : "Münster",
    "[]publicationHistory" ...> but was:<..." : "Münster",
    "[location" : "Münster",
    "]publicationHistory" ...>

Which originates from a - seemingly - inconspicuous element (via macro regex-del-punctuation-end) in titleRelatedFields.xml.

Could this due to the fact that the source data has duplicates?

  <datafield tag="264" ind1="3" ind2="1">
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Regensberg</subfield>
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Historische Kommission, Landschaftsverband Westfalen-Lippe</subfield>
    <subfield code="c">1951-2004</subfield>
    <subfield code="3">1951-2004</subfield>
  </datafield>
  <datafield tag="264" ind1="2" ind2="1">
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Regensberg</subfield>
    <subfield code="3">1951-1977</subfield>
  </datafield>

Münster appears multiple times.

dr0i commented 3 years ago

Sorry for intervening late: the tests are not working and we have an issue for that. For the rest of the problem I can't say because I haven't checked that on my own yet.

blackwinter commented 3 years ago

the tests are not working

I suspect both issues have exactly the same root cause. Your tests don't give reliable results (seems to be the location issue mentioned above), because of the inconsistency between generating the test data and testing them (as well as between different test runs).

I was able to reproduce the erratic behaviour with a reduced test case (HT020391499 and HT003176544, in this order):

Possible outcomes for HT003176544 (as of hbz/lobid-resources@ccc34fc):

  1. All 610 fields gain type.
  2. The 650 field loses type.
  3. Both 1 and 2.
  4. No changes.

Test scenarios:

  1. Generating test data while also extracting from the tarball: 20/20 outcome 1.
  2. Generating test data without extracting from the tarball: 20/20 outcome 3.
  3. Testing pristine test data: 19/30 outcome 1, 11/30 outcome 2.

Control test with only HT003176544:

  1. Scenario 1: 19/20 outcome 2, 1/20 outcome 3.
  2. Scenario 2: 18/20 outcome 3, 2/20 outcome 1.
  3. Scenario 3: 30/30 outcome 4.

Which leads to the following questions:

  1. Why does extracting from the tarball affect the outcome of generate runs? (1 vs. 3 and 2 + 3 vs. 1 + 3)
  2. Why do generate runs oscillate between different outcomes? (2 + 3 vs. 1 + 3)
  3. Why does one extra test file affect the outcome of test runs? (1 + 2 vs. 4)
  4. Why do test runs oscillate between different outcomes? (1 + 2)
blackwinter commented 3 years ago

So here's a heavily reduced test case that hopefully still catches the essence of this issue (at least w.r.t. questions 2 and 4, maybe 3). You can run it with the following Gradle invocation from the metamorph subdirectory:

 ../gradlew --rerun-tasks test --tests org.metafacture.metamorph.Issue366Test -Dissue366.verbose=true -Dissue366.iter=100

It shows that option 0 (the issue at hand) fails 50% of the time (misses k) and that options 1-3 (small modifications to the morph) always pass (while option 1 only differs from options 2-3 in that it yields a different order in the output - as expected).

Input:

{"10012":{"a":"V"}}

Option 0:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??.a">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Output 0:

{"r":{"v":"V"}}         // 47/100
{"r":{"k":"K","v":"V"}} // 53/100

Option 1:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Diff 1:

--- option0.xml
+++ option1.xml
@@ -1,7 +1,7 @@
 <metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
   <rules>
     <entity name="r" flushWith="100??">
-      <choose flushWith="100??.a">
+      <choose flushWith="100??">
         <data name="k" source="100?2.a">
           <constant value="K"/>
         </data>

Output 1:

{"r":{"v":"V","k":"K"}}

Option 2:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??.a">
        <data name="k" source="100??.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Diff 2:

--- option0.xml
+++ option2.xml
@@ -2,7 +2,7 @@
   <rules>
     <entity name="r" flushWith="100??">
       <choose flushWith="100??.a">
-        <data name="k" source="100?2.a">
+        <data name="k" source="100??.a">
           <constant value="K"/>
         </data>
       </choose>

Output 2:

{"r":{"k":"K","v":"V"}}

Option 3:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100?2.a">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100?2.a" />
    </entity>
  </rules>
</metamorph>

Diff 3:

--- option0.xml
+++ option3.xml
@@ -1,12 +1,12 @@
 <metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
   <rules>
     <entity name="r" flushWith="100??">
-      <choose flushWith="100??.a">
+      <choose flushWith="100?2.a">
         <data name="k" source="100?2.a">
           <constant value="K"/>
         </data>
       </choose>
-      <data name="v" source="100??.a" />
+      <data name="v" source="100?2.a" />
     </entity>
   </rules>
 </metamorph>

Output 3:

{"r":{"k":"K","v":"V"}}