Superfluous ancestors in coherence in attestations graph

jjmccollum commented 2 years ago

While preparing my 2022 Greek Paul Project Webinar talk on the CBGM, I noticed that some coherence in attestations diagrams feature superfluous textual flow ancestors outside of the attestation of the specified reading:

B25K1V5U4Rb-coherence-attestations

To reproduce this issue:

Copy the 3_john_collation_complete.xml file from the open-cbgm/examples directory into your active directory.
Populate an example genealogical database with populate_db.exe -t 100 -z ambiguous -Z lac 3_john_collation_complete.xml cache.db.
Generate the coherence in attestations diagrams with print_textual_flow.exe --attestations cache.db B25K1V5U4.
Generate a PDF of the diagram for reading b with dot -Tpdf -o attestations\B25K1V5U4Rb-coherence-attestations.pdf attestations\B25K1V5U4Rb-coherence-attestations.dot.

I suspect this has something to do with the way that the textual_flow class stores multiple textual flow ancestors with different readings for the same witness. This is intended only to be used in the coherence in variant readings diagram (in which all changes of reading within the connectivity limit are highlighted), but I must have failed to add a check to ensure that this doesn't happen in the textual_flow::coherence_in_attestations_to_dot method. Invoking

find_relatives.exe cache.db 104 B25K1V5U4 c

and

find_relatives.exe cache.db 459 B25K1V5U4 c

reveals that the superfluous ancestor 1827 is indeed within the connectivity limit for both 104 and 459.

Naturally, this problem only arises in variation units with more than one variant reading. This issue can also be observed in the following units:

B25K1V6U8-10, reading d
B25K1V12U28-30, reading c (while A, with reading a, is the fifth-closest ancestor to 35, the only superfluous ancestor with reading a is 1845, is the fifth-closest ancestor of 1890).

jjmccollum commented 2 years ago

Here's the relevant portion of code in textual_flow::coherence_in_attestations_to_dot:

        //Then add a secondary set of vertices for the primary ancestors of these witnesses with a different reading,
    //along with edges indicating changes in reading:
    unordered_set<string> secondary_set = unordered_set<string>();
    unordered_set<string> processed_destinations = unordered_set<string>();
    for (textual_flow_edge e : edges) {
        //Get the endpoints' IDs:
        string ancestor_id = e.ancestor;
        string descendant_id = e.descendant;
        //If the descendant is not in the primary vertex set, or the ancestor is, then skip:
        if (primary_set.find(descendant_id) == primary_set.end() || primary_set.find(ancestor_id) != primary_set.end()) {
            continue;
        }
        //Otherwise, we have an ancestor with a reading other than the specified one;
        //skip it if we've added it already:
        if (secondary_set.find(ancestor_id) != secondary_set.end()) {
            continue;
        }
        //If the destination already has an edge drawn to it, then skip any secondary ancestors:
        if (processed_destinations.find(descendant_id) != processed_destinations.end()) {
            continue;
        }
        //Otherwise, add the destination node's ID to the processed set:
        processed_destinations.insert(descendant_id);
        //If it's new, then add a vertex for it:
        int ancestor_ind = id_to_index.at(ancestor_id);
        textual_flow_vertex v = indexed_vertices[ancestor_ind];
        string ancestor_rdg = v.rdg;
        out << "\t\t" << ancestor_ind;
        out << " [label=\"" << ancestor_id << " (" << ancestor_rdg << ")\", color=blue, style=dashed]";
        out << ";\n";
        secondary_set.insert(ancestor_id);
    }

For witnesses like 104 and 459 (i.e., the highest-priority witnesses to the reading they have here), the constructor for the textual_flow class should add one textual flow ancestry edge for each closest potential ancestor with a distinct reading within the connectivity limit. So for both of them, the edge from 049 (reading a) should be added first, and 1827 (reading c) should be added second. In theory, then, the loop above should encounter the edge (049, 104) before it encounters (1827, 104), and the edge (049, 459) before it encounters (1827, 459). I'll need to add some checks to see if something else is in fact happening.

jjmccollum commented 2 years ago

Okay, I think I've resolved the issue. I wasn't marking the descendant nodes as processed when their ancestors with a different reading had already been added the secondary set. The following updated code ensures that the superfluous ancestors are not added:

    //Initially filter out any vertices that do not have the given reading:
    list<textual_flow_vertex> coherence_in_attestations_vertices = list<textual_flow_vertex>(vertices);
    coherence_in_attestations_vertices.remove_if([&](const textual_flow_vertex & v) {
        return v.rdg != rdg;
    });
    //Add all of the graph edges ending at one of the remaining vertices, except for secondary graph edges for changes:
    unordered_set<string> witnesses_with_rdg = unordered_set<string>();
    for (textual_flow_vertex v : coherence_in_attestations_vertices) {
        witnesses_with_rdg.insert(v.id);
    }
    list<textual_flow_edge> edges_with_descendants_with_rdg = list<textual_flow_edge>(edges);
    edges_with_descendants_with_rdg.remove_if([&](const textual_flow_edge & e) {
        return witnesses_with_rdg.find(e.descendant) == witnesses_with_rdg.end();
    });
    //Then add the secondary graph edges, taking only the first one for each descendant:
    unordered_set<string> processed_descendants = unordered_set<string>();
    list<textual_flow_edge> distinct_descendant_edges = list<textual_flow_edge>();
    for (textual_flow_edge e : edges_with_descendants_with_rdg) {
        if (processed_descendants.find(e.descendant) != processed_descendants.end()) {
        continue;
    }
    processed_descendants.insert(e.descendant);
    distinct_descendant_edges.push_back(e);
    }
    list<textual_flow_edge> coherence_in_attestations_edges = list<textual_flow_edge>(distinct_descendant_edges);
    //Add any ancestors on these edges that do not have the given reading:
    unordered_set<string> ancestors_without_rdg = unordered_set<string>();
    for (textual_flow_edge e : distinct_descendant_edges) {
    if (witnesses_with_rdg.find(e.ancestor) == witnesses_with_rdg.end()) {
        ancestors_without_rdg.insert(e.ancestor);
    }
    }
    for (textual_flow_vertex v : vertices) {
    if (ancestors_without_rdg.find(v.id) != ancestors_without_rdg.end()) {
        coherence_in_attestations_vertices.push_back(v);
        }
    }

These changes are reflected in the latest push.

jjmccollum / open-cbgm

Superfluous ancestors in coherence in attestations graph #5