Open lichtefeld opened 2 years ago
@lichtefeld I have questions about the details. I think you suggested doing contrastive learning for objects first to simplify things, so I will frame my questions in those terms. Say we have 3 object concepts (I think this is the right number for the current curriculum?), with 10 samples each, so 30 distinct samples total. Then my understanding of the above is:
(situation, language, perception)
.choose(3, 2)
= 3 pairs of concepts. We somehow (?) use the remaining 15 samples not used for training.
cube
with a concept sphere
.cube
with exactly one sample of sphere
? If so, how is that matching chosen? Or, are we looking at all pairs -- so all 25 possible pairs of cube
samples with sphere
samples? Is this a matter of "to be decided"? If we're not pairing samples then what are we doing?cube
sample, sphere
sample) pairs, however we choose those. What exactly are we doing with them? My guess is that first we want to run the learner we trained on the other samples. We get the learner's augmented graph output for the cube
perception and for the sphere
perception, then we want to compute a symmetric difference between those graphs. Is that right?
3*K
where K
is the number of sample-pairs per concept-pair. Are these 3*K
differences the things that we want? I suspect not.My guess is that the answer to the last two questions ("3*K
differences," "why a new pipeline") is: "we want to train a new kind of learner to do (something)
with contrastive samples." Is this guess correct?
If so, here's my guess about the (something)
: This new learner would be applied to learn patterns for symmetric differences in concepts, and it would learn from symmetric differences between the paired perception graphs. Is that right?
And now that I think about it, the new learner (if there is one) has to work differently from the "normal" learners (for objects, relations, actions, etc.), because we don't have a text description of the difference. So in that sense it probably needs to be less like the "normal" learners and more like the "special case" learners (generics, the functional learner).
@spigo900 (responding to your newest comment first) Correct I think this is a special case learner but perhaps more like the concept learners as it may not make much sense to learn a symmetric difference between a cube
and person throws ball
so we could have multiple instantiations of this special case learner for each concept type (focusing on objects
and actions
for Phase 3).
Here are my thoughts/answers on your previous questions:
2.i: Correct we'd train a single learner on 3 concepts.
3.ii: We are pairing samples up the exact method is 'to be decided'. My suggestion is we'd present 5 samples of a cube
-sphere
pair where the cube
and sphere
come from the held-out samples. It's possible a sample never gets presented to the learner in this model or it's possible the learner sees the same sample for cube
5 times against different samples for sphere
. That's fine.
3.iii: Yes I'd run the previous learner's decode on the scene to see if we correctly identify the objects first. Then computing the symmetric difference as described.
3.iv / 4: Rather than just the 3*K
difference sets I think we'd want to look at the specific features for each sample k in the set and use this information to provide a 'weight' (maybe better thought of as an 'importance' value) for each node in pattern for a given concept. Additionally these differences may be useful in presenting a feature as "important to describe 'cube'" even if not needed for the specific comparison seen at evaluation time. This requires an entire new training pipeline if we want to use each sample of the triple (situation, language, perception)
and not combine them into one sample. If we combine them into one sample we need some way to identify which object is the cube
and sphere
for the learning process and we're taking advantage of the curriculum (to assume the instructor never lies) to give us the correct label for a set of features.
Option 2 for this approach is to train a different model entirely based on the resulting differences but I've not given that enough thought (or research) to comment on the viability. That said the (something)
idea is an interesting approach and, given we can identify object clusters in a perception graph even when not given a linguistic label, could be an interesting way to determine how to describe a novel object as "different than a cube by X" & "Different than a ball by Y", etc.
Let me know if I need to clarify anything in these responses.
@lichtefeld I agree it makes sense to have different instantiations for different kinds of concept.
Thank you for the answers. 2.i, 3.ii and 3.iii are clear now. I have more questions re: the 3.iv / 4 response which I'll write up tomorrow morning.
@lichtefeld So first, here are some things I think we agree on (correct me if I'm wrong):
(cube sample, sphere sample)
(cube situation + sphere situation, cube language + sphere language, cube perception + sphere perception)
since that doesn't make sense, would confuse all the other learners, etc.Now, some things I'm less sure about (questions!). I split this up into sections, which I hope makes them easier to think about/respond to.
I was confused about "look at the specific features for each sample k", but here's my interpretation. What this means is that we want the learner to look at (a) a matching of the graphs which the difference graph falls out of and which is strictly more informative (so we can distinguish "this came from cube
and this came from sphere
"), and (b) the specific nodes (or connected components?) in the difference rather than treating the graph as a black box. (If that's wrong, let me know. I can imagine other interpretations.)
For concreteness, here's an example in the symbolic framework. Maybe we would observe that the difference between a table and a chair is that a chair has a back, and usually a chair's legs are shorter than a table's legs (pretending we had that latter level of detail). Maybe sometimes we can't see the back, so it's not in the difference graphs, but when it does show up it guarantees (in distribution π) that what we're looking at is a table, so this should have a high importance weight. On the other hand if the legs are not shorter then that's fine -- maybe the table is a coffee table. That means the leg length feature should have lower weight. (Hopefully the difference in leg length causes the chair and table legs not to match so they show up in the difference graph and we can notice that difference.)
Here's one idea for the importance might be something like: In the context of a difference between cube
and sphere
, how often have node X been in the difference, out of all the times that node X was present in at least one of the graphs?
(This is one reason why we want to look at the matched pair, not just the difference: If we only look at the difference, then we can't distinguish between "we couldn't see the back of the chair" and "the chair and the table both (for whatever strange reason) had backs." The other reason is so that we know which pattern to push the importance weight into.)
if I understand right, this assessment of node importance is something we want to push down into the patterns of the other learners while we are learning. Assuming so, I suppose this is one of the reasons you wanted to be able to mutate the pattern graphs? Also, this is something we'd do after observing each pair of samples?
The other thing we might want to do is add "important" (based on n>1 sample) missing nodes back into the pattern, with extra info to flag them as "match optional." I'm not sure about the details and will probably have more questions later but just to confirm this basic picture is correct.
Both of these will probably require going backwards from the matched graph pair to the patterns that the object (or other) learner recognized in each of them.
On a different line of thinking: The contrastive learner needs to know the relevant concept for the paired samples it is learning from. But concepts are internal to the learners and not part of the language (which is just text), the unprocessed perception, or the situation. So this must be something we get by running the lower-level learner, right?
These are mostly unrelated to the previous discussion and hopefully easier to answer:
ObjectConceptNode
(I think it was called).Oh, one more question: The top suggests an experiment. What is it experimenting with/on? What comes out of that experiment? What are the "deliverables" -- what numbers, charts, PDFs, etc. do we hope to get from the experiment? My guess is that we would want to look at how the pattern importance values change/what nodes get added as a result of contrastive learning. We would probably want to look at these in the specific -- since there are so few concepts, there's no need to aggregate because we can just look at the data itself. So we might produce, for each of the concepts, a modified graph visualization that labels each pattern node with the difference between its final and initial weights. But there could be other things.
I've organized my response with similar headings to yours. I agree with the four bullet points you start off with so we are on the same page there.
a) Correct we want the contrastive learner to look at the difference graph as more informative which requires knowing which concept the distinguishing features came from. b) For the nodes (or connected components) I'm operating under an assumption that "the learner knows that connected to the 'root' of the concept are all the potential features which contribute to the identification of the concept". The graph is treated as less of a 'black box' in this example and more as a data structure with a known value type. The learner can know it is operating over a set of features.
I think the simple model of importance you've described is a good starting point.
Both of these basic ideas match my original thoughts on how to take advantage of the importance information. I'd think we would want to push the information back into the appropriate patterns for a concept after every difference evaluation yes. It may mean we need to also pass the reference to the patterns down the pipeline to make this access easier.
We will assume our curriculum design never misidentifies a concept when described linguistically. That is if the language for a scene is 'sphere' then a sphere
concept is going to be present in the scene. With this assumption we can use the linguistic phrase as our label for each of the two samples in a pair. Programmatically it probably makes more sense to ask the first-stage learners for the concept which matches the input linguistic phrase but we can identify the concept from the linguistic label rather than needing to identify the objects in the graph correctly first.
x) Correct I think the contrastive learner is a meta-learner that influences how other learners make their decisions.
y) I don't think we can use the contrastive learner at test time other than perhaps to generate the feature difference lists for display purposes (DARPA would like to see such a list if possible)
z) I don't recall the answer to this. I don't think they do because I believe I'm pulling the current feature list for the viewer from the graph and strokes are present here but you'd have to double check the implementation to be sure. The ObjectClusterNode
is still replaced by an ObjectConceptNode
(required currently if we want to be able to have a recognized object participate in an ActionConcept
)
As far as deliverables what you've described is what I envisioned as the easy "here's how taking advantage of difference detection impacts our ability to define an object". Hopefully system accuracy at identifying concepts at test-time is also increased. I agree that the best representation of this improvement is more of a qualitative analysis of the pattern graphs rather than strictly a quantitative result.
--
These are all good questions. Hopefully I've answered them or at least provided feedback that's useful enough to continue the conversation on a topic :)
@lichtefeld Thanks! Those were good answers. I think that all makes sense.
I have one more question about mapping concept signals to the graph: It seems reasonable to assume the description describes something actually in the scene, but I'm confused about how we know what that something is if the scene may contain additional objects that aren't described. Can the scene have other objects, or are we assuming no "background noise"?
I may have another question about aligning the language with concepts consistent with "ground truth", but I looked into it and I think I have an answer about how we do that in the form of enrich_during_learning()
, so maybe not. π
I have one more question about mapping concept signals to the graph: It seems reasonable to assume the description describes something actually in the scene, but I'm confused about how we know what that something is if the scene may contain additional objects that aren't described. Can the scene have other objects, or are we assuming no "background noise"?
We're really leaning into the ability to design the curriculum explicitly for the learner so we can assume essentially no "background noise" past being able to handle the signal processing. In a scene designed to train "sphere" for example we won't also have a 'ball' present.
That said, in-principle, we could probably handle a scene with background objects as we do have a way to extract out a sub-perception graph for an object concept (even if I need to go remind myself exactly how to do that). We could essentially transform a single input scene into multiple potential perception graphs and use the learner decode as the labels for the contrastive learner. While this would be possible in theory I think we can just have it available as an explanation of 'how to train with noisy real-world data' rather than an explicitly curated curriculum.
Okay, so we assume that there is no background noise because we can and have designed out the background noise. We probably have a way of handling them if we need to, but there's no need at this stage. That all makes sense.
I think there are some technical details left to work out, but I think I have enough information now that I can start thinking about a design, so I will start thinking about that, and hopefully get a draft to you within the next few working days.
ETA: that was a lot of thinks π
@lichtefeld I've drafted a design for contrastive learning. I'll wait for your feedback before moving ahead with this.
Also, not immediately relevant, but a question about actions. How would you pick out the concept for action samples, given the language might be different while the concept is the same? Say "a cube falls" vs. "a sphere falls". I have thought of a simple workaround for this case, but I'm curious if curriculum design has changed/can change such that this doesn't pose a problem anymore.
I made a list of some complications I've run into so far. I marked with β the one "thread" and solution that seem straightforward to resolve. I'm marking with β threads that I haven't resolved yet, with π¨ for problems, β‘οΈ for "more on those problems",ββ for possible solutions. (Let me know if this is helpful. I marked it because I was having trouble seeing the structure in this sea of text.)
Sorry this is long. It may make sense on Monday for me to write a shorter version.
Here's the list:
GraphNode
, thus do not actually have weights.
weight
to NodePredicate
._enrich_common()
, since that's the part that does the matching and that method throws away the match information. It's also not being a public method even if it didn't.match_template()
notes, The template may have... many matches... with the exception of the object learners, some matches may be essentially identical to each other.
I suspect the object learners exception is under the old assumption we're replacing the object subgraphs, and we are not doing that anymore, so this complication probably applies even though we're only looking at object learners.concepts_for_patterns()
does not return everything we need to know if we want to propose updated hypotheses.eq=False
, if we replace the patterns then we have to identify parts of the graph through some means other than "identity", because if we store "the original node" in the counter/dictionary, then after we update the pattern the counting key is no longer good. Next time we try to look up "that node" (edge) in the counter, we won't find anything because the new pattern node (edge) is technically a completely new node (edge).(A specter is haunting ADAM: The specter of graph matching.)
So let me see if I can condense this down (lossy of course, and adding some things):
template_learner.py:193
.)propose_updated_patterns()
turned out to be a lot more complicated than expected to implement for several reasons. The quickest solution for subset is probably to violate immutability, though I prefer to avoid that.
I"ll plan to think a little more about these, especially proposing updated patterns.
So I'm not entirely sure the best way to respond to all the open threads in a way that's not confusing so I'm going to start with just some overall feedback / respond to the condensed version and will try my best to make any specific comments clear to where they respond :)
Might we currently take the 'easy' route of an arbitrary matching is the 'canonical' match and we acknowledge the contrastive learner won't be purely deterministic. (Mostly annoying for reproducibility and catching bugs). We can mark this problem on its own issue and circle back to solving it better if we have time. This is essentially the same 'hack' the learning modules, thought I suspect their lack of determinism is less impactful overall than this will be for the reasons you mention.
propose_updated_patterns()
& ImmutabilityIf there's anywhere within ADAM where a class should be mutable it's arguable the Patterns themselves as these should, and do, evolve over time. The biggest problem with removing the Immutable guarantee now is a) mutability causes its own set of problems and b) the original codebase was designed with Immutability in mind so it's possible that somewhere we use a Pattern in set
s or dict
s as key's and so now allowing the patterns themself to mutate would be a problem. Before deciding to go the easier route of mutable patterns we'd need to do a relatively complete check for this condition.
I believe in theory Learner's can have more than one template for a concept (Pursuit can for example) but for both learner types of current curriculum design ensures that the first pattern generated does match to the correct object. So we could allow for contrastive learning to occur so long as the learner only has one Hypothesis for a concept. That 'gets around' problem one by the curriculum design which is fine given for the current phase we have complete control over the curriculum.
I agree with starting with the aggregate weight across all pairs. We may need a more fine-grained system in the future but for now aggregation sounds good to me.
@lichtefeld
propose_updated_patterns
: I agree that patterns are probably the place where mutability is most reasonable, and that changing things at this point is difficult. I agree, before implementing the hack we'd need to check of "things that rely on immutability" and patterns in sets or dicts are one concrete way they might rely on immutability. Only learning for concepts with one hypothesis sounds like a good-enough alternative; I'll start thinking through the details on that.Re: updating only one pattern once as a fix to propose_updated_patterns()
, here are some minor problems with simple solutions:
TemplateLearner
, concepts_to_singleton_hypotheses() -> Mapping[Concept, PerceptionGraphTemplate]
propose_updated_patterns()
can simply reject any updates to concepts with multiple hypotheses (logging a note at debug level).Then there's the more complicated one of node identity. I think that because we're currently limiting ourselves to the subset learner, we could pretend it will accept the update exactly and leave the general solution as an issue for the future. Does that make sense?
Because pattern nodes are immutable, we have to create new ones if we want to change the weights. But nodes implement, and must implement, identity rather than value equality. These modified-weight nodes have a different identity from the original, thus are unequal.
This poses a problem for my proposed weight assignment rule. For that rule, we need to track node identity or something very close. (If we go to counting by node type, I think that loses too much of the graph structure to be useful.) If we change node identity then our old "node counts" dictionary/counter is no longer helpful, because we'll get a value of 0 when we look up the new version of the same old node.
(Come to think of it, with this update rule we'd likely have problems with hypotheses changes invalidating the old node IDs, such that we can't do [isolated-examples learning -> contrastive learning -> isolated-examples -> contrastive]
. But that seems fine, since we don't have any plans to do that.)
I suspect we could relax the demand for node identity. What we "really want" (going back to the problems with matches not being unique) is to somehow track and update nodes based on their "place" or "role" in the graph, rather than their object identity. This sort of goes along with the pattern updating issue.
But I don't have great ideas for replacing node identity. We could replace it with something more place-like -- say a pattern covering the node and its immediate neighbors (with the node "distinguished" somehow) -- but that means doing a lot more graph matching with all the problems that entails.
(I think the problems with "neighborhood patterns" are fairly serious, mainly a matter of performance. Doing this means an extra helping of nondeterminism, but that seems like the smaller problem. The bigger problem is that this we need to run the pattern matching a lot. The number of matches required should be proportional to the number of unique node-neighborhood configurations we've seen. That means the number of patterns we need to match could get very large depending on how unique the local graph structures are. I'm not sure if it helps or hurts that the patterns would be so small.)
Since the contrastive learner is the one changing node identity, we know the mapping, so we could address this directly. Just add the new node IDs with the old count (after updating the old count).
This works as long as the learner doesn't create any new nodes while processing the update. This wouldn't hold if the learner creates new nodes during the update.
A problem with this solution is that it keeps references to old nodes. We could avoid this by deleting the old nodes from the dictionary. However, this requires further assuming that the learner doesn't use any of the old nodes. This assumption wouldn't hold if, for example, the learner rejects the pattern entirely.
Here's a kind of dumb idea. What if the contrastive learner added "tags" to each node it observes so it can identify them if/when they come up again later? That is, an identifier for each node attached to the node in the digraph structure. Since only the contrastive learner is using them it shouldn't cause too much of a problem, and because it's part of the digraph it doesn't require changing any of the predicate classes at all.
This works as long as propose_updated_pattern()
preserves the "tags." I'm not sure it's reasonable to expect this, though. This seems like a borderline case of coupling. It's a touch of coupling the contrastive learner to every apprentice learner it collaborates with.
This might work better in tandem with another solution to fall back on for identifying nodes.
Maybe the contrastive learner should track the patterns itself. Whenever we learn from a new example, if the apprentice's pattern for the concept isn't identical to what the contrastive learner last proposed, we do graph matching to figure out which parts are "the same" and migrate the counts appropriately. As with the other "multiple matches" problem, use an arbitrary match and leave "do something smarter with matches" for another time.
(Also, not immediately relevant, but I think re-adding nodes is going to turn out to be complicated, now that I'm thinking about the problem of "keeping track of node/subgraph identity over time." We'd have to keep track of "the parts that aren't in the pattern" in some way that (1) makes sense in the context of many different perception graphs, and (2) retains the graph structure. A messy solution could involve having, for each concept and each "other" concept, a pattern for "things that weren't in this pattern, and weren't in both perception graphs"... that requires somehow checking any the connections at the "boundary" though, to represent the structure perfectly. Though it's not clear at this stage if we need to store that structure.)
The minor problems and solutions look good to me.
What if instead of the node itself we kept counts of the number of times we'd seen a given 'ontologyNode' or CategoricalNode(label="affordance", value="rollable")
by the values the nodes contain. E.g. we count the number of times Counts[node.label][node.value]
occurs compared to the total of len(Counts[node.label])
. The main complication here that we don't have a uniform interface for nodes (I'm now very sad about this) that appear in a PerceptionGraph so there's some isinstance
checking that has to be done to special case the code.
This looses any information attached to the graph structure itself but that may be ok? I can't foresee any immediate problems with this approach but I'd also plan around being wrong here and be pleasantly surprised if I was right.
This seems like less manual bookkeeping and while it does couple the components it's not like ADAM is as uncoupled as we'd like overall π (Not that this should be a reason to add more coupling when we can avoid it). I don't like this idea but I don't hate it either.
We'd need to do some benchmarking again (yay!) to see how much of an impact matching lots of tiny graphs is. We could generate this test by just creating test data around the size we expect our graphs to be. The only problem with this approach is we know we'd be implementing something that's likely to slow down as the search space of features grows... I'm hesitant to event suggest benchmarking due to that potential impact and a desire to avoid the last three months of this project returning to "how do you optimize the NP-Complete sub-graph isomorphism matching" problem. :)
@lichtefeld My impression is the graph structure is simpler/less informative than it was, and under that assumption I think tracking ontology and categorical nodes seems like a good enough solution. I'll implement that.
ETA: Thanks for the idea!
@lichtefeld I've implemented the count-weighting formula for ontology node predicates and categorical predicates based on the ontology node/categorical value counting scheme. (Categoricals don't yet carry a label.) Correct me if I'm wrong but I think that's all the node types we want to count right now. I looked over the other types, and they don't obviously make sense:
Any
s obviously carry no information to count. And
itself carries nothing we can use.Axis
, Geon
, CrossSection
, Region
, or Path
things,ImmutableSet[Point]
is hashable and if the pattern is basically static, the values shouldn't change.I have a few more things to finish up, but I think this "proof of concept" is almost ready for testing. I have one more issue I'm planning to work on re: the contrastive learner itself. I'm currently working on implementing the subset learner side of things. I'll also need to go back to the contrastive learning experiment code and make sure that still makes sense.
(The contrastive learner issue is this: I realized I'm not sure it makes sense to use the concepts from the LanguageConceptAlignment
because that might be contingent on what the object learner detected and not the "ground truth" for the observation. So I need to go check where that concept alignment comes from.)
(Categoricals don't yet carry a label.)
I suppose that's not a bug strictly speaking (so long as all values across all categories are unique) but given the uniqueness guarantee isn't enforced anywhere it seems like a bug waiting to happen. I'll open a new issue to discuss this because there's a reasonable argument for not adding a category label
--
I believe the list of things we're not currently counting is a fine place to start. The biggest concern I currently see is we don't have a great way to identify differences in the strokes currently but we can approach that after the initial implementation is complete. This may be one of the reasons to integrate a GNN model into ADAM ?
The major changes are done. I've implemented the subset learner changes. I've made sure the contrastive learning code makes sense (to me π).
I went back and looked at the remaining issue re: the contrastive learner and confirmed things don't work as I wanted. The LanguageConceptAlignment
returned from enrich()
is wrong because it tries to match the concepts recognized in the perception graph, as recognized in the perception graph, to the language. So my understanding is that if the learner thinks the cube is a sphere, it will try to match the cube
template to the language and fail, so the LanguageConceptAlignment
won't include any semantic nodes/concepts.
What we really need is a way to ask the apprentice learner for the concept(s) it recognizes in the language for each scene. I think I have a simple enough hack for objects. We can call copy_with_new_nodes()
, passing the object learner's surface template for every possible concept. This should work because there are no slots for us to align, so the alignment code will just look for a span alignment and eventually it should find the "right" concept for each scene.
We may need to do something smarter for actions. In particular if we need to handle something like slot0_gives_slot1_to_slot2
, the two fixed strings gives
and to
are separated by a slot that will cause matching to always fail. (When there is no alignment of slots to spans, we ignore the slot nodes and just try to match the fixed strings, so we'd be looking for gives to
which won't ever match an actual example of the template.) There might be other complications for actions.
I should be able to address this tomorrow and get started on testing. @lichtefeld, where can I find the current curriculum when I start on that?
@spigo900 I'll send you a zip of the current curriculum.
I've addressed the alignment issue using the template matching hack I described above. The contrastive learner is now pulling the concepts from the language rather than from what the apprenticed subset learner recognized in the perception. I've confirmed the concept pairs being pulled out look reasonable -- we're geting 5 each of (cube, sphere), (cube, pyramid) and (sphere, pyramid).
There's one problem left to debug. The problem is that the contrastive learner isn't able to match the apprentice's patterns for the concepts to the corresponding graphs. This results in the contrastive learner learning nothing at all. I haven't looked too deeply into this yet. I plan to give it a harder look on Monday.
Actually, now that I checked the node types I think I know what the problem is. There's no ObjectClusterNode
but there is an ObjectSemanticNode
in the perception graphs, so my guess is the apprentice learner replaces the cluster node, thus causing its own patterns to no longer match the graph. π Fortunately this should be pretty easy to fix.
Actually, now that I checked the node types I think I know what the problem is. There's no ObjectClusterNode but there is an ObjectSemanticNode in the perception graphs, so my guess is the apprentice learner replaces the cluster node, thus causing its own patterns to no longer match the graph. π Fortunately this should be pretty easy to fix.
Yes the Object apprentice learners replace ObjectClusterNodes
with ObjectSemanticNodes
.
I made the change, and things are "working" but the contrastive still can't learn anything from the current curriculum. The problem is there's a mismatch between the curriculum and the learning/weighting scheme we designed. There are no Categorical
nodes in the perceptions, and there are of course no SemanticNodes
when we look at the un-enriched perception, so there's nothing the contrastive learner can count and it doesn't learn anything.
How do we want to address this? These absences don't seem like bugs, and don't seem fixable in the current weighting scheme.
If we don't think of a better idea before then, I'll plan to spend some time Monday thinking about a different weighting scheme that works for objects.
ETA: One stupid weighting function that could work would be "increase the weight when this node is present in the difference of the perception graphs, decrease it when it's in the intersection." (For some value of "increase" and "decrease.")
@lichtefeld I thought about this (see below), and if we weight things based on matching, the only predicates we can meaningfully weight at the moment (ETA: among those actually present in the subset object learner's hypotheses) are StrokeGNNRecognitionPredicate
s. See the section below.
Should I go ahead and implement such a weighting scheme? I suspect the specific scheme doesn't matter too much in light of this, and it would be easy enough to extend the "counting" function to cover this predicate type.
On the other hand if weighting schemes are so limited then maybe they're too limited for an interesting experiment and not worth bothering with. Through such a scheme the contrastive object learner should be able to learn/teach something about the performance (precision/recall?) of the GNN recognizer. But this is boring in the sense that "the GNN says it's probably a cube" is completely opaque to us humans as a feature. It's not clear what this feature means about the input, other than "a bunch of math thought this input was a cube."
I don't think we can do better without #1110 (which I think we've discussed in meetings as relevant to contrastive learning). If we want to get more interesting results from contrastive object learning, I think we need to be able to be able to distinguish between strokes in more interesting ways. But strokes are continuous and we don't currently have a good way to distinguish between continuous features. I don't immediately see other ways of contrasting strokes if the GNN recognizer can't do that.
Thinking out loud about what we want from a weighting function...
It has to weight at least one of the three types of nodes that actually show up in the subset learner's hypotheses, which types are:
AnyObjectPredicate
(from ObjectClusterNode
)StrokeGNNRecognitionPredicate
(from StrokeGNNRecognitionNode
)ObjectStrokePredicate
(from ObjectStroke
)But if we're just looking at what matched vs. didn't, how much information we can expect a contrastive learner to get out of these nodes? Some thoughts on the above types:
So as long as we are weighting based on the current matching setup, it seems like the GNN recognition predicate is the only predicate type for which we can really expect to learn useful weights. From a "system performance" perspective, it's probably good to know how much to weight the GNN node, but from a "learning contrasts" perspective, it seems pretty boring to only be able to weight that one feature. What we'd really like to do for objects is weight the strokes, but that is hard to do.
Should I go ahead and implement such a weighting scheme? I suspect the specific scheme doesn't matter too much in light of this, and it would be easy enough to extend the "counting" function to cover this predicate type.
I'm not opposed to implementing such a weighting scheme. It's a potential we need more features to better explain objects as our set increases and we increase the node types that support this weighting scheme.
... But this is boring in the sense that "the GNN says it's probably a cube" is completely opaque to us humans as a feature. It's not clear what this feature means about the input, other than "a bunch of math thought this input was a cube."
Correct we'd need to be able to do better inspection in the GNN and find some way to exploit contrastive learning to either discover when a) 'ADAM thinks the GNN is wrong' or b) Better train the GNN (or a different GNN and have multiple GNN output features to investigate)
--
So as long as we are weighting based on the current matching setup, it seems like the GNN recognition predicate is the only predicate type for which we can really expect to learn useful weights. From a "system performance" perspective, it's probably good to know how much to weight the GNN node, but from a "learning contrasts" perspective, it seems pretty boring to only be able to weight that one feature. What we'd really like to do for objects is weight the strokes, but that is hard to do.
We could explore a 'better matching' system within ADAM for strokes. The biggest issue is 'anything we'd probably think to do is probably flawed in a way the GNN is adapting for'. For example, if I was given two sets of points and asked "are these lines similar" I'd plot them and see if a I can get a line of best fit that works 'reasonably well' for each. How could we do that automatically? Well we could just calculate a line of best fit (or a polynomial line to the nth degree). However this seems fairly brittle. Alternatively we could try to calculate the angle between any lines? For a cube that should get us angels of approx ~90 degrees. This 'angle between strokes' is then potentially a continuous value that needs to be matched? However that brings us back to the "what happens when we extract a cube with only three sides" problem. Subset itself can't easily handle multiple patterns for the same concept.
(That said I'm actually not opposed to trying to use some math to calculate that type of feature. Seems like a fairly-low hanging fruit to embed)
And there's a question of if this level of feature interest is important anyway when instead we could put the time into having contrastive learning integrate with (the/a) GNN system.
--
Once affordances are a thing that's a potentially better capture for the contrastive learner? (It's at least hopefully a categorical node which can be counted more easily...)
I just extended weighting to cover StrokeGNNNodes, took ~10 minutes plus ~5-10 of debugging small unrelated issues. (Turns out I forgot to actually call propose_updated_patterns()
after constructing the updated pattern. π) The results show it works as expected, though they are fairly boring: It weights the graph recognition nodes as high as possible. The GNN doesn't break on any of the contrastive learning samples and misclassify the paired instances as having the same class.
(I'm adding 0.5 to the final weight for "things we can actually weight" because otherwise it's impossible to distinguish between "it's at the default weight" (1.0) and "it's at the max weight for this scheme" (also 1.0). We may eventually want to do something smarter re: relative weights of unweightable nodes vs. weighted nodes.)
On a semi-related note, it looks like the situation doesn't store the path to its own feature.yaml
file. I was wondering why it wasn't getting copied when it looked like it should be according to this bit of code. I was going to use the copies to check the GNN's output for the contrastive samples used. I may look into this more tomorrow.
I agree, affordances seem more interesting for contrasts. Though if the affordance learner runs as part of the action learner, we'll never naturally end up with an affordance in the object patterns, so there won't be any to weight. It definitely feels like we should be able to do something with affordances and contrastive learning, though. Maybe this is a case where we want to do the "reintroduce nodes in pattern" thing (though here the nodes were never present so it's not strictly re-adding nodes that got shaved off).
I don't have more to add at this point, though I want to think more about this tomorrow.
Looks like the feature thing is a simple problem at least. When we load the curriculum we search for feature files using the glob expression feature_*
but the filename is just feature.yaml
so it doesn't match.
One difficulty with using affordances here is that affordances would run after the object learner has already replaced the object cluster node. So the contrastive object learner can't straightforwardly work with the affordance learner's enriched output perception graphs, because the object learner's patterns won't match to that graph anymore. We would have to either hack it (undo the cluster->semantic
replacement in the graph, or replace the cluster nodes in the pattern), or... probably there's a better solution if I thought about it more.
ETA: re the hack, it probably makes more sense to change the graph, since that doesn't force extra specialcasing to deal with pattern updates
Some thoughts here on handling distances and some looser thoughts on introducing missing pattern nodes.
As mentioned at this week's team meeting, this needs to handle distances to handle actions. Distances are likely to show up in actions. For example, with "take": You have to be pretty close to an object to take it. Distances pose some issues for contrastive learning. First, distances are represented as two separate nodes. Second, distances are continuous.
Distances are represented as two separate nodes: A discrete direction and orientation (axis) as well as a node measuring the distance in meters. One problem that may come up is that the patterns we're matching and contrasting might contain one of these nodes but not the other -- the distance without the axis, or the axis without the distance. I think this is not a problem for the current approach, or rather to the extent it's a problem we'd want to deal with it in other ways. However, this might provide some motivation for adding missing nodes to patterns. If one of the two distance nodes is only relevant when distinguishing action X from action Y then maybe it's useful to be able to add that back in. Looking at our 20 actions, I don't see an obvious pair for which this should be true, so I don't think it's worth implementing that at this point, but we might discover such a pair later and find that this is worthwhile.
The more difficult problem, I think, is that distances are continuous. That means they cause all the same problems for contrastive learning as colors and strokes. However, in theory we already have to deal with the continuity of distances to learn actions in the first place, as long as we're not strictly relying on the actions GNN. I'm hoping that whatever approach we use for actions, we can also use for contrastive learning. @lichtefeld, thoughts on this?
One way of doing this might look as follows:
next_id: int
and just increases it each time it assigns some IDs. Maybe write a convenience function that handles assigning IDs when we create a pattern from a graph. Then we increment the next unused ID by the number of nodes in the graph.The more difficult problem, I think, is that distances are continuous. That means they cause all the same problems for contrastive learning as colors and strokes. However, in theory we already have to deal with the continuity of distances to learn actions in the first place, as long as we're not strictly relying on the actions GNN. I'm hoping that whatever approach we use for actions, we can also use for contrastive learning. @lichtefeld, thoughts on this?
It's on my list to implement after I wrap up the initial affordance learner but I'm planning to match continuous values with a distribution. So during the learning stage every introduction of a new distance would add to a running mean, standard deviation, max, and min values. (max and min is mostly for being able to display those values... I'm unsure they are strictly needed for determining a match). Come time for matching a match is made if the determined values has a greater than X% chance of being in the distribution formed by the mean and std deviation. This % chance should probably be set by a parameter for the moment. Ideally this value would instead contribute to the 'match score' in some way but that's a larger technical challenge than I want to take on if we don't have to.
@spigo900 Thoughts on this approach? We also discussed in slightly in our technical meeting. If you're at a point where you need something to implement this may be a reasonable task to take on.
I think I'd need to consider this problem more to have any more advice.
@lichtefeld I think that makes sense if we can implement it. I'm not seeing an easy place to slot in that logic. Ideally it would go in the node but it's not clear how that would work since the nodes are immutable. Also we kind of have to "confirm" a match before we can finalize the update, so putting it in the equivalence logic seems doubly not ideal. I think we used to do some post-processing after intersection, and that might be a good fit for the subset learner, but leaves other learners hanging. Overall not sure what to do here.
I agree about IDs. Re: match ratio, as far as PyCharm can find we currently use match ratio only for propose-but-verify and cross-situational learners. (I thought pursuit used it as well but that's not showing up.) Currently we're only using the subset learner as it's not clear how to update patterns for the other two, and the subset learner doesn't compute a match ratio (it's relatively fragile that way). We could change this, but I'm not familiar enough with the ratio stuff at the moment to comment on how complicated that might be. Also couldn't comment yet on the "doesn't count in ratio" approach for missing nodes.
It seems like we have to deal with one of two really annoying problems:
I'll have to think more about this.
I think ideally we'd probably want to redefine our perception graph and pattern graph data structures to conform to the new assumptions but that's almost certainly out of scope.
I agree that the resolution of updating the pattern nodes is probably a post_processing
step which should be possible to define... I'd have to go through the match process to see where that could be slotted in the easiest.
Also, huh, would have sworn Pursuit used the match ratio somewhere... I wonder if PyCharm just can't find it because of inheritance structures...
It may be worth ~1/2 hours of investigation to see how much would break if we could just let pattern nodes be mutable. I suspect a lot so that change wouldn't be practical but wouldn't know for sure without investigating.
Ah, I figured out the ratio issue. Pursuit does use a match ratio. What pursuit doesn't do is use compute_match_ratio()
to calculate it. It uses its own instance method, _find_partial_match()
. The objects version of this method could easily be refactored to use compute_match_ratio()
. It looks like the pursuit attributes and pursuit relations learners would require only slightly more work. Though that's not needed at this point.
I'll plan to take a half hour today to look into immutability.
You know, I think maybe mutating pattern nodes is okay. I was slightly worried mutability would mean we couldn't hash pattern nodes thus breaking a lot of things, but I ran the tests and nothing broke. It looks like attrs is smart enough to still take care of hashing when eq=False
. I looked at the usages for NodePredicate
and didn't see anything obviously broken.
I think we still want to strictly control where and when we mutate nodes -- we probably want to be able to match patterns without updating the underlying nodes -- but I don't see any place where it obviously breaks things. I looked at references to NodePredicate
, to PerceptionGraphPattern
, and to PerceptionGraphTemplate
. I didn't see anything obviously broken, though of course I didn't look at all of the references in detail in ~30 minutes.
I am slightly worried this is missing something -- I suspect PyCharm is not picking up on references to instance variables, only arguments and return types. But I can't think of anything that should break.
For distances I think we want to constrain this predicate update to happen only on a "successful" pattern match. The problem is that we don't know what a "success" means from inside the matching code -- that's defined by the learners themselves. Given this, I think we want to contain the update logic to a confirm_match()
method on the Pattern
that delegates to each node, passing a match object. This is not ideal because it means we have to remember to add this call in the appropriate places, and we need to get a match object in some places where we didn't use to do that. However, it would prevent accidental updates.
Ok, do you want to take on the task of implementing this matching & mutability for continuous values? Also this new matching strategy may have an impact on how an intersection of a match works? Essentially we'd want to make sure the continuous value with a given label isn't dropped from the pattern.
Edit: I think we should probably only update any probability range after a successful match. There's an argument to be made that during training we should match only by the label of the continuous value and during evaluation should actually test the range but that may not be easy to implement right now?
Sure, I can take this on. I'll create a separate issue for continuous value matching/mutability.
As part of previous presentation feedback we want to ensure we are taking advantage of the designed curriculum to enable contrastive learning examples. Programmatically within ADAM to compare two different scene images against each other we would need two different inputs to compare to determine the distinctive features between two different actions. To learn from contrastive examples I'd like to consider the following approach:
This implementation would be non-trivial as its in-essence and entire new learning pathway for ADAM. I suspect this would take 2-3 weeks at minimum to fully implement and do basic testing of experiments.