Source reuse may cause type mismatches

nsbgn commented 2 years ago

Reusing sources may cause overly general types to be inferred the first time it's encountered, causing type mismatches the second time around.

The first time we encounter this is in solution113.ttl as generated by https://github.com/quangis/quangis-workflow-generator/tree/58735e1372b5a8d13463ece60ac8617d160560cb. To reproduce, do the following:

from cct import cct
from transformation_algebra import Source

s = Source()

# ZonalStatisticsMeanInterval
expr1 = cct.parse("""
1: Field(Itv);
2: ObjectInfo(Nom);
join_attr
    (get_attrL 2)
    (apply1 (fcont avg 1) (get_attrL 2))
""", Source(), s)

# SpatialJoinSumTessRatio
expr2 = cct.parse("""
        1: ObjectInfo(Ratio);
        2: ObjectInfo(Nom);
        join_attr
            (get_attrL 2)
            (join (get_attrL 2) (groupbyR sum (join_key
                (select eq (rTopo
                    (pi2 (get_attrL 1))
                    (pi2 (get_attrL 2))
                ) in)
                (getamounts 1)
            )))
""", s, Source())

print(expr1.tree())
print(expr2.tree())

Swapping expr1 and expr2 solves the issue.

This is because types are fixed as soon as we encounter a source, which is fine if a source is only used once but might be incorrect if it is encountered more often. Passing through the structure twice would solve this.

nsbgn commented 2 years ago

Cf 7093485

nsbgn commented 2 years ago

Type fixing gets very precarious when the output types of tool applications are derived from the input types.

Some form of that problem will always remain, because we're mixing unifying types (ie saying that x and y are the same when they are both variables) and setting subtypes on them (ie saying that x is a subtype of A if the latter is a concrete type). That's fairly fundamental. While it's okay to deduce from an application of f: x ** x ** x to A that the output type of f must be x >= A, we are essentially doing the reverse for sources: when we have f (1: A), we try to deduce the input type, which we are unifying with x. We know from 1: A that it must be <= A and from applying f that it must be >= A. But that's reasoning about x as it appears in the type signature for the particular application of f --- the type of the source input itself can always be more specific when used in another application.

Can we one-way decouple the type of the application from the type of the source itself? I'm not sure how I would approach that.
Labelling all sources with their concept type before parsing the tool applications would be ideal. It avoids the issue of 'guessing' input types, and it makes sure that we're always really working with the most specific type possible (rather than the most general type a tool happens to accept). However, this is not always an option: generated workflows don't include CCT types on data.
Not sharing concept nodes for sources would also solve the problem. Every time a source is used, a new node would be made with the most general type for that particular application. However, there are a couple of caveats. First, more specific types would not carry over into the output. Second, tool applications are reused (and ideally would remain to be), so this behaviour would be inconsistent with that. Consequently, it would also still be possible to craft a workflow in which a tool application (rather than a source) produces an overly general type for the next tool application.
I think the best way will be to do two passes when a source is encountered: the first time to find the most specific type across all expressions, the second to actually connect all the expressions. This is not perfect, because expressions are treated as islands and we might not always find a concrete type for each source. But if all sources are annotated within a tool, this is like a middle ground between (2) and (3).

Hope this is somewhat clear for posterity, because it'll definitely be a good candidate for revisiting.

nsbgn commented 2 years ago

From the previous commit:

It does reduce the errors in generated workflowsfrom github.com/quangis/quangis-workflow-generator; they are from 44 down to 27:
   $ transforge graph cct -T build/tools.ttl solutions/s*.ttl --skip-error 2>&1 | grep Skipping | wc -l
   27
However, it is not enough to eliminate all errors. I think some errorsare due to sources actually being different, and APE just treating allsources with matching CCD type as being the same?

nsbgn commented 2 years ago

Now reproduce with:

from cct.language import cct
from transforge import TransformationGraph
from transforge.workflow import WorkflowDict
from transforge.namespace import EX

wf = WorkflowDict(EX.root, {
    EX.IDWInterval: (
        "interpol (1: PointMeasures) (deify (-: Reg))",
        [EX.aed1]
    ),
    EX.ZonalStatisticsMeanInterval: (
        """
        1: Field(Itv);
        2: ObjectInfo(Nom);
        join_attr
            (get_attrL 2)
            (apply1 (fcont avg 1) (get_attrL 2))
        """,
        [EX.IDWInterval, EX.ca72]),
    EX.SpatialJoinSumTessRatio: (
        """
        1: ObjectInfo(Ratio);
        2: ObjectInfo(Nom);
        join_attr
            (get_attrL 2)
            (join (get_attrL 2) (groupbyR sum (join_key
                (select eq (rTopo
                    (pi2 (get_attrL 1))
                    (pi2 (get_attrL 2))
                ) in)
                (getamounts 1)
            )))
        """,
        [EX.ca72, EX.ZonalStatisticsMeanInterval])
}, {EX.ca72, EX.aed1})

g = TransformationGraph(cct)
g.add_workflow(wf)

nsbgn commented 2 years ago

The problem is that, when types are fixed as the final step of the process rather than immediately after applying, we often end up with types like x [x <= Itv, x >= Itv] rather than Itv. When we try to make that a subtype of Nom, it fails, because Nom <= Itv fails --- whereas if we just had Itv, it would have succeeded.

We can change type fixing to happen immediately again, and this does fix our issue. The errors are down to 8, and they look like annotation errors rather than inference errors.

However, this would not be the 'final', proper way to solve it. I think the types of values should not be as intertwined with the types of function applications; this will be another issue.

quangis / transforge

Source reuse may cause type mismatches #110