spigo900 commented 2 years ago

We would like to run an experiment in backfilling. @marjorief gave a motivating example: Suppose we're looking at a black-and-white photo of a person eating an apple. We should be able to say, this is probably an apple or an orange. And we should be able to fill in from there, it is probably red or green or orange in color. The idea here is: Train an action learner as usual, then at test time cause object recognition to ~fail, either by ablating features or directly. We would then measure how well we are able to backfill using the affordance learner (#1129).

In the "ablating features" instantiation of this which we discussed yesterday, @blakeharrison-ai would output grayscale images (or do postprocessing to remove color) before @shengcheng does the curriculum decode. An issue with this approach is that I expect the GNN to simply output the wrong object concept, in which case I think we will either (1) successfully match the wrong pattern/recognize the wrong thing, so no backfilling takes place, or (2) fail to match any pattern, so that the object never gets replaced and we fail to recognize the action (because the slot nodes can only match to ObjectSemanticNode s, not ObjectClusterNode s). I think (1) may not be problematic if we're okay with running backfilling regardless of whether we know what the object is, but (2) seems like more of a problem. @lichtefeld, do you have thoughts on this problem?

The other approach might be to directly replace the object semantic node with "I don't know." This is hard to do with our existing learners. However, I think we can do a hack where we define a new GNN recognizer that knows when it's being evaluated, and at test time chooses one of the objects in the scene to replace with an "I don't know" semantic node. (That would mean setting up a new "unrecognized object" concept.) That way we know what concept the object was labeled with and it is not the same as any "real" concept, but the object still has a semantic node so it can be recognized as taking part in the action. As a bonus we also know we're ablating only one object.

spigo900 commented 2 years ago

1128 is relevant here because if we can match color properly, it should persist better in the subset learner, giving us a "something" to fill in when we say "this is probably an apple or an orange." @lichtefeld and I plan on using the learner's patterns to fill in details based on the experimental affordance learning "it's probably an X" outputs (#1129).

lichtefeld commented 2 years ago

2) fail to match any pattern, so that the object never gets replaced and we fail to recognize the action (because the slot nodes can only match to ObjectSemanticNode s, not ObjectClusterNode s)

At any point a perception graph is observed after an ObjectLearner has processed it all ObjectClusterNodes have been replaced by ObjectSemanticNodes regardless of it we know how to describe the object semantic node. I believe it gets replaced with something like a FunctionalObjectSemanticNode which is an ObjectSemanticNode descendant and doesn't get converted into output language (unless annotated by the FunctionalLearner)

tldr: Problem 2 is already mostly-solved from Phase 1/2 work.

(1) successfully match the wrong pattern/recognize the wrong thing, so no backfilling takes place

We could aim to tune a better threshold for confidence. If the shape confidence is below a configured threshold we fail and resort to backfilling as our default rather than asserting a low confidence answer. (Quasi precision vs. recall tuning.)

spigo900 commented 2 years ago

tldr: Problem 2 is already mostly-solved from Phase 1/2 work.

I'd forgotten about that, that's good to know.

We could aim to tune a better threshold for confidence. If the shape confidence is below a configured threshold we fail and resort to backfilling as our default rather than asserting a low confidence answer. (Quasi precision vs. recall tuning.)

Hmm, maybe. Are we talking about the GNN's confidence outputs? I am hesitant to rely too much on a neural model's confidence outputs this way, because in my work on MICS I've been working on addressing a problem where T5's (a language model's) confidences are completely useless by default (see Figure 1 (a), showing how T5's question-answering accuracy is uncorrelated with its confidence). If the baseline confidence output were as meaningless as it is in that context, fixing problems with confidence would be out of scope, and thresholding may not help. But UnifiedQA does a little better in that figure, so maybe it's not hopeless. Also, the GNN is being used "as intended" in a sense and not contrived to do an entirely different task, so it might have a better shot. So maybe the GNN's confidence outputs are good enough.

lichtefeld commented 2 years ago

So maybe the GNN's confidence outputs are good enough.

Correct I am referencing the GNN's confidence outputs. I don't want to claim just using the confidence values 'solves' the problem of knowing when to call a sample novel but baring other object feature extraction for objects it's what we do have to work with. Discovering that actually 'GNN output confidence doesn't correlate well with detecting novel object concepts' is still a useful result even if we're no closer to solving the problem.

spigo900 commented 2 years ago

Not a reply to the above, but re:

(1) successfully match the wrong pattern/recognize the wrong thing, so no backfilling takes place, or (2)

I'm noting a third alternative -- for in-domain objects like apple or orange, (3) the GNN might well recognize the object correctly, in which case we have nothing to backfill.

isi-vista / adam

Backfilling experiment #1130