korpling / annatto

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.
Apache License 2.0
0 stars 0 forks source link

Weird behaviour in complex workflow featuring xlsx export #228

Open MartinKl opened 2 months ago

MartinKl commented 2 months ago

When running annatto in disk mode, annotation layers seem to get lost in a setting with lots of manipulations (it's possible that export starts before the annotation storage is ready). Running in memory mode, though, everything passes.

The following workflow failed (it's not about the details, but about the complexity) to export annotation norm::auto_lemma when run on disk, but succeeded in memory. Note that only the export to xlsx showed this behaviour, the graphml file always contained the lemma layer.

format = "xlsx"
path = "./Luther/2_Exceldateien/Fabeln/"

column_map = { norm = [], edition = [], text = [] }

format = "treetagger"
path = "Luther/6_rnn-tagger-output/Fabeln-tagged/"
config = {}

action = "split"

delimiter = "."
anno = "default_ns::pos"
index_map = { "norm::auto_pos" = 1 }
keep = false

"norm::Gender" = ["Fem", "Neut", "Mask"]
"norm::Case" = ["Nom", "Gen", "Dat", "Akk"]
"norm::Degree" = ["Pos", "Komp", "Sup"]
"norm::Tense" = ["Präs", "Prät"]
"norm::Mood" = ["Ind", "Konj"]
"norm::Number" = ["Sg", "Pl"]
"norm::VerbClass" = ["Sw", "St", "Unr"]
"norm::Person" = ["1", "2", "3"]

action = "enumerate"

queries = ["tok @* annis:doc=/Lut_F_0Vorrede_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_10Hund_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_11Mogenhofer_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_12Esel_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_13Stadtmaus_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_14Rabe_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_1Torheit_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_2Hass_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_3Untreu_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_4Neid_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_5Geiz_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_6Frevel_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_7_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_8Dieb_tg/ @* annis:node_name=/Fabeln-tagged/", "tok @* annis:doc=/Lut_F_9Kranich_tg/ @* annis:node_name=/Fabeln-tagged/"]
target = 1
label_ns = "source"
label_name = "id"
start = 0
value = 2

action = "enumerate"

queries = ["norm @* annis:doc=/Lut_F_0Vorrede_tg/", "norm @* annis:doc=/Lut_F_10Hund_tg/", "norm @* annis:doc=/Lut_F_11Mogenhofer_tg/", "norm @* annis:doc=/Lut_F_12Esel_tg/", "norm @* annis:doc=/Lut_F_13Stadtmaus_tg/", "norm @* annis:doc=/Lut_F_14Rabe_tg/", "norm @* annis:doc=/Lut_F_1Torheit_tg/", "norm @* annis:doc=/Lut_F_2Hass_tg/", "norm @* annis:doc=/Lut_F_3Untreu_tg/", "norm @* annis:doc=/Lut_F_4Neid_tg/", "norm @* annis:doc=/Lut_F_5Geiz_tg/", "norm @* annis:doc=/Lut_F_6Frevel_tg/", "norm @* annis:doc=/Lut_F_7_tg/", "norm @* annis:doc=/Lut_F_8Dieb_tg/", "norm @* annis:doc=/Lut_F_9Kranich_tg/"]
target = 1
label_ns = "target"
label_name = "id"
start = 0
value = 2

action = "check"

query = "source:id"
expected = [1, inf]
description = "There are source annotations"

query = "target:id"
expected = [1, inf]
description = "There are target annotations"

action = "link"

source_query = "source:id"
source_node = 1
source_value = [1]
target_query = "target:id"
target_node = 1
target_value = [1]
link_type = "Pointing"
link_name = "align"

action = "check"

query = "node ->align node"
expected = "norm"
description = "There are as many alignment edges as there are norm annotations."

query = "node? !->align norm"
expected = 0
description = "There is no norm node without an ingoing alignment edge."

action = "collapse"

ctype = "Pointing"
layer = ""
name = "align"
disjoint = true

action = "revise"

remove_nodes = ["Fabeln-tagged", "Fabeln-tagged/Lut_F_0Vorrede_tg", "Fabeln-tagged/Lut_F_10Hund_tg", "Fabeln-tagged/Lut_F_11Mogenhofer_tg", "Fabeln-tagged/Lut_F_12Esel_tg", "Fabeln-tagged/Lut_F_13Stadtmaus_tg", "Fabeln-tagged/Lut_F_14Rabe_tg", "Fabeln-tagged/Lut_F_1Torheit_tg", "Fabeln-tagged/Lut_F_2Hass_tg", "Fabeln-tagged/Lut_F_3Untreu_tg", "Fabeln-tagged/Lut_F_4Neid_tg", "Fabeln-tagged/Lut_F_5Geiz_tg", "Fabeln-tagged/Lut_F_6Frevel_tg", "Fabeln-tagged/Lut_F_7_tg", "Fabeln-tagged/Lut_F_8Dieb_tg", "Fabeln-tagged/Lut_F_9Kranich_tg", "Fabeln-tagged/Lut_F_0Vorrede_tg#text", "Fabeln-tagged/Lut_F_10Hund_tg#text", "Fabeln-tagged/Lut_F_11Mogenhofer_tg#text", "Fabeln-tagged/Lut_F_12Esel_tg#text", "Fabeln-tagged/Lut_F_13Stadtmaus_tg#text", "Fabeln-tagged/Lut_F_14Rabe_tg#text", "Fabeln-tagged/Lut_F_1Torheit_tg#text", "Fabeln-tagged/Lut_F_2Hass_tg#text", "Fabeln-tagged/Lut_F_3Untreu_tg#text", "Fabeln-tagged/Lut_F_4Neid_tg#text", "Fabeln-tagged/Lut_F_5Geiz_tg#text", "Fabeln-tagged/Lut_F_6Frevel_tg#text", "Fabeln-tagged/Lut_F_7_tg#text", "Fabeln-tagged/Lut_F_8Dieb_tg#text", "Fabeln-tagged/Lut_F_9Kranich_tg#text"]

"source::id" = ""
"target::id" = ""
"default_ns::lemma" = "norm::auto_lemma"

format = "graphml"
path = "./"
config = {  }

format = "xlsx"
path = "xlsx-with-tags/"
config = { include_namespace = false, annotation_order = ["edition::edition", "text::text", "norm::norm", "norm::auto_pos", "default_ns::lemma", "norm::auto_lemma", "norm::Case", "norm::Degree", "norm::Gender", "norm::Mood", "norm::Number", "norm::Person", "norm::Tense", "norm::VerbClass"] }
MartinKl commented 2 months ago

When running in disk mode, this line sometimes yields true and sometimes false for the node holding the lemma annotation: https://github.com/korpling/annatto/blob/78b4d0471f13a97f2b6c3ce4efd403dc22977693/src/exporter/xlsx.rs#L80

In contrast, in memory mode, it always returns false

MartinKl commented 2 months ago

Importing the graphml file with annis cli, a query tok _ident_ auto_lemma returns 0 matches.

All of this together points to a not completely updated storage, probably the storage of the Coverage component which influences the result of is_token

MartinKl commented 1 week ago

We figured out, that this happens because Coverage components can get unloaded in workflow steps that use a CorpusStorage. Even though AQL queries can now be executed on graphs directly, which would avoid unloading, removing CorpusStorages is not an option right now, since a lot of graph_ops rely on the correct order of results which only CorpusStorage provides.

If run in memory, the either CorpusStorage does not unload the Coverage component or is simply fast enough in reloading it again, so the bug does not occur.