isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
11 stars 3 forks source link

Affordance Learner #1111

Closed lichtefeld closed 2 years ago

lichtefeld commented 2 years ago

For Phases 1 & 2 ADAM's learner used proto-semantic roles to determine which objects could fill the SLOT of an action pattern. While there was previously some work on extracting this set of proto-roles from video that doesn't align well with Phase 3's goal of feature discovery. While we may still need some notion of these 'proto-roles' to distinguish between the 'agent' and 'theme' of an action I'm setting aside this portion of the feature to look at action affordances. That is the ability of a ball to be rolled (e.g. rollable) when observed in the 'roll' action and the ability to determine what about the object's features enables this affordance.

The remainder of this issue takes two parts the first explains the high-level learning process I'm currently envisioning for this process (and has been loosely talked about before in presentations/meetings) and the second covers some more technical details of the implementation.

Affordance Learning

As described above an affordance for purposes of this learner can be thought of as an ability of an object to participate in the given action. For clarity 'participate' in the action can mean in multiple different semantic roles. E.g. both a person and a ball are 'rollable' if we want to also encode the semantic e.g. "rollable" vs "can rollable" we in theory could do this but I'm unsure immediately how we'd have ADAM discover the semantic difference without some kind of proto-roll information available as a feature. We can't rely on the agent being slot_0 and the theme slot_1 as a representation hack either so I'm open to potential ideas here. For the moment I'm going to stick with the idea of learning affordances of the form 'rollable'.

What does the learner look like

I'm envisioning the affordance learner as a new composable learner as the goal of the affordances is to find a collecting of existing features that enables the affordance. For the example 'rollable' affordance would ideally identify features within the stroke-graph that enable this functionality. This affordance concept maps to a feature node which is a categorical feature with the category "affordance" and the values are an expanding set along all actions observed.

How and where in the pipeline will learning occur

Learning affordance, if the learner is enabled, will occur after the action learner has decoded any actions for the scene. The affordance learner will take in the action matched and generate a feature for this action. This feature will be an incremental counter of 'affordance_X' where X is the number of actions seen so far. The affordance for a given action will always map to the same affordance ID. Learning for an affordance will then take the object pattern for the object aligned to a given slot in an action and replace the object cluster with a SLOT filler. The description of the object will then be subset across all examples of objects which participate in the same slot of the given action. This subsetted pattern can then be match in the original perception graph before action matching occurs to indicate that other objects share this same affordance even if we've never observed a novel object being rolled before.

Ideally the learning of this feature can be propagate into the concept for the given action.

Where will decode take place

Decode will take place after initial object recognition as we'll need an ObjectSemanticNode to fill the SLOT created in the affordance pattern with. In practice this will probably occur just before action learning for simplicity. This process will change the perception graph by adding in a new type of node than we've previously added during the process. Rather than just adding a semantic node we'll be adding and node designed as a potential identifying feature.

Technical Implementation

Adding to Integrated Learner

I'm not thrilled to be adding into the integrated learner but I think this is going to be the best route for this module. We need these affordances available within the learning process of other learners.

Debug Vs. real affordance IDs

I'm considering implementing the unique value for affordances differently between a debug mode and a 'real' mode. In debug mode the affordance id would be generated in the form of "able to X" where X is the action that triggered the creation of this affordance. While in the 'real' mode the affordance ID would take the form of "affordance_Y" where Y is a running count of the actions seen. In theory this provides a way for replacing "affordance_Y" with the actual name (e.g. 'rollable') in the future. While I'm not using these names as anything other than ID I suspect if I try to present to DARPA with "able to X" as the affordance value they'll complain I'm leaking data into the system even though ADAM doesn't use that value past a string comparison. Therefor sticking with the "affordance_Y" approach for real data would be preferred in demo situations.

Potential issues

spigo900 commented 2 years ago

I think this makes sense for the most part. I have a few comments below.

Affordance learning

I am sort of confused by the discussion of proto-roles, so to clarify: What we are not allowed to do is assume certain roles/proto-roles for certain slots. We are allowed to assume the slots are meaningfully connected to objects, and to distinguish between the way that a person is "rollable" and the way that a ball is "rollable" -- i.e. we can distinguish between "thing that can roll a ball" (maybe this has a debug ID like affordance(slot0_rolls_slot1)_slot0) and "thing that can be rolled" (maybe affordance(slot0_rolls_slot1)_slot1). This would mean A_j affordances per action, where A_j is the number of slots present in action j. What we forbid is specifically smuggling in a generalization of this that says "thing that can roll a ball" fits some category called "agent."

I'm basing this distinction on the initial discussion plus the discussion in "How and where" which implies we're treating slots separately.

What does the learner look like

Is there a reason we don't want to represent affordances as semantic nodes? I suppose it's mainly keeping the graph small to maintain match speed. The affordances aren't distinct enough to be worth representing as completely separate nodes in the graph because they have such similar roles -- they'll all be connected to the relevant object and nothing else. So for performance reasons maybe it makes sense to collapse them into a single categorical feature node.

Technical implementation

Adding to integrated learner

Adding to the integrated learner does seem like the most convenient place to add things, and I don't see an obvious better solution. This is less than ideal, as you say, because the integrated learner is complicated enough already. But on the other hand I don't see a nice way to slot the affordance learner into the pipeline after the action learner, or at least not without adding to the integrated learner's public interface in some other way. It's especially tricky because the affordance learner has to run after the action learner at "train" time, otherwise it has nothing to learn from, but before the action learner at describe time, otherwise "it's cheating."

Debug vs. real affordance IDs

Distinguishing between debug IDs and real IDs sounds like a good idea. affordance_Y is probably better for demos but hard to tell if it's reasonable or buggy when debugging, so it makes sense to have a separate debug ID. It should be simple enough to distinguish them and prevent information leaks -- we just won't use or show the debug IDs anywhere except in debug logging.

lichtefeld commented 2 years ago

I'm basing this distinction on the initial discussion plus the discussion in "How and where" which implies we're treating slots separately.

Correct, I'd generate affordances in that way. Ideally we'd probably introduce a step which could combine affordances generated in that way which become the same distinction? E.g. "person rolls" is slot0_rolls and "person rolls a ball" is slot0_rolls_slot1 so for affordance(slot0_rolls)_slot0 and affordance(slot0_rolls_slot1)_slot1` these should, in theory, be able to collapse into one affordance representing "the ability to roll" regardless of the agency of the object involved in the roll. But I think this level of generalization is out of scope for this phase.

Is there a reason we don't want to represent affordances as semantic nodes? I suppose it's mainly keeping the graph small to maintain match speed. The affordances aren't distinct enough to be worth representing as completely separate nodes in the graph because they have such similar roles -- they'll all be connected to the relevant object and nothing else. So for performance reasons maybe it makes sense to collapse them into a single categorical feature node.

I think when they are able to be lexicalized then it should be added as a semantic node but until then using a just adding a new categorical node enables the affordance to be a part of a pattern.

spigo900 commented 2 years ago

That all makes sense to me.

lichtefeld commented 2 years ago

So I've run into a problem with implementation. First is that as best as I've been able to tell the step to generate candidate perception graphs is disjoint from the step that generates the potential name. This causes a problem because as this learner has no token to align to (e.g. ADAM never sees the word "rollable" we can't use that token as an alignment value during training. One potential solution would be to define a new abstract learner that works in this "learning to non-aligned tokens" space as, in theory, any number of learners could be designed to operate over this non-token learning space (and by some measures the Generic/Semantic/etc learners already do).

So given I'd like to avoid having to reimplement any number of computation steps already handled in our abstract base classes I'm trying to come up with a new solution but don't currently have a good one. The only immediate solution I have is a major hack where rather than learning affordances for all SLOTs in an action phrase we choose a single SLOT for an action template with a given number of slots to learn affordances for. So for a 1 slot action ("person rolls") slot 1 gets the affordance we try to learn. For a 2 slot action ("person rolls ball") slot 2 gets the affordance we try to learn. However this solution is definitely a cheat of knowing how english is structured to get around a technical hurdle and I'm unsure it's the correct route.

spigo900 commented 2 years ago

Hmm, this sounds annoying.

Tell me if this sounds right. The problem is: (Picking out perception patterns) and (picking out surface templates) are disjoint steps. First we pick out the candidate templates, then for each one we try to learn a meaning for each one. But this is a problem for the affordance learner because we don't have any language that it makes sense to use for the candidate template. The input is say "A person rolls a ball". No part of this language references either of the relevant affordances, namely the affordance of being able to roll an object (~agency) and the affordance of being rollable. We're also not necessarily trying to learn to express the concept. We'd like to be able to learn some concept of "rollable" from just observing many objects being rolled, without necessarily needing to learn to express X is rollable.

This is a problem because we'd still like to reuse parts of the template learning code. While describing in language doesn't make sense for the affordance learner, other parts do. The scene preprocessing logic probably still makes sense. The logic of enriching the perception graph with affordance pattern matches still makes sense.

The hierarchy of learners is already complicated and we don't want to make it worse, so we don't want to split the template learner into a parent and child class.

I'm not sure I understand the "one action, one slot-affordance" proposal. Some questions:

  1. Why would we want to return just one? Is there a problem with returning overlapping templates?
  2. The "cheat" part of this would be the heuristic we smuggle in for which slot to learn from? .g. the last one. A heuristic like that would be a problem because we'd be relying on English-specific structural tendencies (like "usually the agent comes first if there is one") to pick an interesting slot to learn about. Or did I misunderstand?

This is maybe even worse, but what if we forced the templates to be distinct by appending the slot we're trying to learn? So basically we'd extract from A person roll a ball two candidate templates: SLOT1 rolls SLOT2 SLOT1 (for the agent part of rolls) and SLOT1 rolls SLOT2 SLOT2 (for the "rollee" or object part of rolls). That would let us learn from the language. The main problem I see is that we'd be learning some funky descriptions which might leak into the output. This would also cause some weirdness with surface templates (num_slots assumes each variable present in the template is unique) but that's fixable.

I'll come back to this tomorrow and see if I can think of anything else.

lichtefeld commented 2 years ago

Why would we want to return just one? Is there a problem with returning overlapping templates?

I don't think we want to return just one. We can return overlapping templates (I think) but there's lot of 'hey we are ensuring we cover the entire utterance because not doing so caused other issues" notes in this section of code :) I'd have to go reread some issues or old notes to fully remember why the choice was made to ensure full-utterance learning.

The "cheat" part of this would be the heuristic we smuggle in for which slot to learn from? .g. the last one. A heuristic like that would be a problem because we'd be relying on English-specific structural tendencies (like "usually the agent comes first if there is one") to pick an interesting slot to learn about. Or did I misunderstand?

Correct the cheat would be embedding english specific structural tendencies to extract the correct affordance we want to learn. We still don't have a name for it but that ensures the code doing the extracting makes the 'correct' choices when generating the hypothesis predicate pattern.

This is maybe even worse, but what if we forced the templates to be distinct by appending the slot we're trying to learn? So basically we'd extract from A person roll a ball two candidate templates: SLOT1 rolls SLOT2 SLOT1 (for the agent part of rolls) and SLOT1 rolls SLOT2 SLOT2 (for the "rollee" or object part of rolls). That would let us learn from the language. The main problem I see is that we'd be learning some funky descriptions which might leak into the output. This would also cause some weirdness with surface templates (num_slots assumes each variable present in the template is unique) but that's fixable.

So SLOT1 rolls SLOT2 SLOT2 is the type of 'surface template' I'd be intending to return. The problem is how do I ensure the surface pattern gets aligned to the extraction of SLOT2's feature space for affordance searching and not aligned to SLOT1? At the point where the pattern itself is extracted I don't know the 'surface template' we're aligning too. (This may be solveable if I look more into the learning code itself but I haven't finished that deep of a review to answer for sure). I can work around funky output by just not implementing the linguistic output return of the AffordanceLearner. Solving the problem of having ADAM align an internal hypothesized pattern to a new language description is a future problem (and more in the language domain than the feature learning domain).

spigo900 commented 2 years ago

Okay, I think that makes sense.

... embedding english specific structural tendencies ...

At the point where the pattern itself is extracted ...

So if I read these two parts right, the problem is we'd be exploiting the tendencies to figure out how to align the slot to perception?

intending to return

Don't we know the affordance surface pattern when generating hypotheses? The template learner passes it into learning_step(). _hypotheses_from_perception() (not technically part of the template learner, but part of the subset and pursuit ABCs) also seems to accept the bound template as a parameter. Is the problem that we need to match the corresponding action pattern so we know which object is the slot1?

lichtefeld commented 2 years ago

So if I read these two parts right, the problem is we'd be exploiting the tendencies to figure out how to align the slot to perception?

That would be the intent... but I'm not sure it would entirely work either.

Don't we know the affordance surface pattern when generating hypotheses? ... Is the problem that we need to match the corresponding action pattern so we know which object is the slot1?

Yes that's the root of the problem. There's no way in the affordance learner to ensure the ObjectSemanticNode we choose as the correct viable hypothesis is the ObjectSemanticNode that matches the slot filler alignment for the slot of interest in the affordance.

spigo900 commented 2 years ago

Hmm, this is tricky.

The ActionSemanticNode s store their slot fillers. Could the affordance learner somehow get the appropriate concept(s) from the action learner, then use that to find the right semantic node(s), then finally use that information to pick out the slot fillers? Sounds gross to write, but might get around the hack.

lichtefeld commented 2 years ago

Yeah -- If I was going to generalize it I'd define something like a 'Meta Learner' class which takes in an existing concept alignment and uses information from it to generalize into the learning process. It's unclear to me if its worth the extra level of generalization however I may go along the same pathway and define the AffordanceLearner to be a part of the action learner. It can function by running decode before the action learner and running learning afterwards. This will also make the job of propagating the new 'affordance node' into the action predicate graph easier.

Only downside is I introduce some coupling of components, but that seems to be inherent with the problem itself so I'm unsure I can easily separate the two. (That said it would be implemented in a way where a VerbLearner can run without an AffordanceLearner but not the other way around).

@spigo900 I think this just formalizes your proposed idea?

spigo900 commented 2 years ago

@lichtefeld

I think this just formalizes your proposed idea?

Yes, that sounds about right.

I agree seems to make sense to embed the AffordanceLearner into the action learner. As you said, the coupling seems inherent in the task of affordance learning. The one-way-ness of the dependency makes sense to me.