isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
11 stars 3 forks source link

Reworking affordances #1127

Open spigo900 opened 2 years ago

spigo900 commented 2 years ago

There are some difficulties with our current approach to affordances, namely that because we are relying on visual features to define our patterns, our affordance patterns are likely to end up overbroad. apple and banana don't share visual features we can pick out as defining a "can be eaten" affordance. We would like to try a different approach.

Memorizing approach

We discussed this on Monday; here is a summary of that discussion:

  1. Idea: Affordances don’t properly match with observable visual features, they match with names. The system should work like this.
  2. Specifically, when we observe “a person eats an apple”, we memorize the slot fillers person and apple as being able to take part in eats in the appropriate slots.
    1. That is, we’re filling in “rows” of a relational table where each row relates an object concept like "apple" to a slot in an action concept, e.g. a row might be:
      1. (object "person", slot1, action concept "slot1_eats_slot2"), or
      2. (object "apple", slot2, action concept "slot1_eats_slot2"), or
      3. (object "apple", slot2, action concept "slot1_throws_slot2"), …
  3. “Recognizing affordances for an object” means querying the table on the object part, say "apple".
  4. We need to do "backfilling." That is, given say a photo of someone eating an apple, there's no color, but we'd like to be able to pick out some possibilities on the features we do have and say, this might plausibly be red or green or orange. In theory we could do this through affordances.
  5. “Backfilling” like the black-and-white photo of someone eating an apple, might look like querying the table on the action/slot part, say (slot2, action concept "slot1_eats_slot2").
    1. This is (theoretically) how we would say “well that might be an apple”.
    2. We might want to do something with our graph patterns here to filter out bad candidates, e.g. with the photo we would want to be able to say it’s probably not a banana because the shape is wrong (though that specific example might be beyond what we can fix in the time we have left — shape is hard).
  6. Then we would introduce these affordances and backfilled possibilities into our output somehow, maybe as a post-processing step.

Difficulties

There are some difficulties with this approach. Two:

  1. This approach means we can't use those features in contrastive object learning. That's because object patterns can't incorporate affordances: The affordances an object has depend on what kind of object it is, and the patterns are the thing we use to determine the kind of object.
    1. Note we could still contrast objects in terms of the affordances they have; the point here is simply that this information could not be put into the object patterns.
  2. Since we're not defining affordances in terms of visual features, they won't generalize to unknown objects unless we classify those unknown objects as actually being some known object. (If we classify the object like a known one, we'll just apply the affordances for that known object.) To the extent this is a problem, we would need a way of handling unknown-and-unclassified objects.