isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
10 stars 4 forks source link

Contrastive Learning - Discussion #1102

Open lichtefeld opened 2 years ago

lichtefeld commented 2 years ago

As part of previous presentation feedback we want to ensure we are taking advantage of the designed curriculum to enable contrastive learning examples. Programmatically within ADAM to compare two different scene images against each other we would need two different inputs to compare to determine the distinctive features between two different actions. To learn from contrastive examples I'd like to consider the following approach:

  1. Load all samples for each concept (action) to learn into memory in a processed-perception state.
  2. Choose half of the samples for each concept (at random) to form the base subset/pursuit model for a given concept
  3. For the remaining 'half' of the samples perform a contrastive comparison between two different samples of every pairwise concepts. (For 20 concepts that's 190 pairwise comparisons with 5 samples in each comparison)
    • For each pairwise concepts we'd generate a collection of the distinctive attributes between the concepts. Ideas on how to use this data:
    • Use this distinctive information to 'weight' nodes in the graph with 'attention' by trying to match those elements first. This may improve the speed of matching as most previous optimizations focused on failing fast as a success requires, de facto, a complete match regardless of when we fail.
    • Use the distinctive attributes to augment the original subset pattern by re-introducing any nodes which have been ablated away but discovered to be distinctive in certain comparisons. This information could be augmented with hints about when it was most useful or perhaps 'not count' against a failed match when not present.
    • Other ideas ?
    • This will require a new training pipeline to handle the pairwise contrastive examples (not a trivial addition to the codebase)
  4. At test time we can use all this additional information to state:
    • Distinctions between two concepts that we've seen previously but may not be present in this specific example.
    • Hopefully improved detection of concepts in a noisy environment due to the presence of contrastive data.

This implementation would be non-trivial as its in-essence and entire new learning pathway for ADAM. I suspect this would take 2-3 weeks at minimum to fully implement and do basic testing of experiments.

spigo900 commented 2 years ago

@lichtefeld I have questions about the details. I think you suggested doing contrastive learning for objects first to simplify things, so I will frame my questions in those terms. Say we have 3 object concepts (I think this is the right number for the current curriculum?), with 10 samples each, so 30 distinct samples total. Then my understanding of the above is:

  1. We load all 10 samples for each concept by reading from the data produced by the stroke-based pipeline. Each sample is stored on disc as a file and loaded as a triple of (situation, language, perception).
  2. We take 5 random samples for each of the 3 concepts, so 15 total, and train a single subset/pursuit object learner on those 15 samples.
    1. Is this right? Or is the plan to train 3 separate learners, one for each concept?
  3. We now want to compare the choose(3, 2) = 3 pairs of concepts. We somehow (?) use the remaining 15 samples not used for training.
    1. For these questions suppose we are trying to compare a concept cube with a concept sphere.
    2. It sounds like we are pairing the samples in some way, but how? Are we matching held-out samples? That is, pairing every sample of cube with exactly one sample of sphere? If so, how is that matching chosen? Or, are we looking at all pairs -- so all 25 possible pairs of cube samples with sphere samples? Is this a matter of "to be decided"? If we're not pairing samples then what are we doing?
    3. Suppose we have these (cube sample, sphere sample) pairs, however we choose those. What exactly are we doing with them? My guess is that first we want to run the learner we trained on the other samples. We get the learner's augmented graph output for the cube perception and for the sphere perception, then we want to compute a symmetric difference between those graphs. Is that right?
      1. So, try to match the two augmented graph outputs as much as possible, then remove whatever nodes matched from both graphs, and the difference should be the union of what's leftover in each graph. (And we might want to do some postprocessing, of course, like removing unconnected nodes from the result.)
    4. But this gives us one difference per contrastive-sample-pair per concept-pair, so not 3 differences but 3*K where K is the number of sample-pairs per concept-pair. Are these 3*K differences the things that we want? I suspect not.
  4. Why does this require a whole new training pipeline?

My guess is that the answer to the last two questions ("3*K differences," "why a new pipeline") is: "we want to train a new kind of learner to do (something) with contrastive samples." Is this guess correct?

If so, here's my guess about the (something): This new learner would be applied to learn patterns for symmetric differences in concepts, and it would learn from symmetric differences between the paired perception graphs. Is that right?

spigo900 commented 2 years ago

And now that I think about it, the new learner (if there is one) has to work differently from the "normal" learners (for objects, relations, actions, etc.), because we don't have a text description of the difference. So in that sense it probably needs to be less like the "normal" learners and more like the "special case" learners (generics, the functional learner).

lichtefeld commented 2 years ago

@spigo900 (responding to your newest comment first) Correct I think this is a special case learner but perhaps more like the concept learners as it may not make much sense to learn a symmetric difference between a cube and person throws ball so we could have multiple instantiations of this special case learner for each concept type (focusing on objects and actions for Phase 3).

Here are my thoughts/answers on your previous questions:

2.i: Correct we'd train a single learner on 3 concepts.

3.ii: We are pairing samples up the exact method is 'to be decided'. My suggestion is we'd present 5 samples of a cube-sphere pair where the cube and sphere come from the held-out samples. It's possible a sample never gets presented to the learner in this model or it's possible the learner sees the same sample for cube 5 times against different samples for sphere. That's fine.

3.iii: Yes I'd run the previous learner's decode on the scene to see if we correctly identify the objects first. Then computing the symmetric difference as described.

3.iv / 4: Rather than just the 3*K difference sets I think we'd want to look at the specific features for each sample k in the set and use this information to provide a 'weight' (maybe better thought of as an 'importance' value) for each node in pattern for a given concept. Additionally these differences may be useful in presenting a feature as "important to describe 'cube'" even if not needed for the specific comparison seen at evaluation time. This requires an entire new training pipeline if we want to use each sample of the triple (situation, language, perception) and not combine them into one sample. If we combine them into one sample we need some way to identify which object is the cube and sphere for the learning process and we're taking advantage of the curriculum (to assume the instructor never lies) to give us the correct label for a set of features.

Option 2 for this approach is to train a different model entirely based on the resulting differences but I've not given that enough thought (or research) to comment on the viability. That said the (something) idea is an interesting approach and, given we can identify object clusters in a perception graph even when not given a linguistic label, could be an interesting way to determine how to describe a novel object as "different than a cube by X" & "Different than a ball by Y", etc.

Let me know if I need to clarify anything in these responses.

spigo900 commented 2 years ago

@lichtefeld I agree it makes sense to have different instantiations for different kinds of concept.

Thank you for the answers. 2.i, 3.ii and 3.iii are clear now. I have more questions re: the 3.iv / 4 response which I'll write up tomorrow morning.

spigo900 commented 2 years ago

@lichtefeld So first, here are some things I think we agree on (correct me if I'm wrong):

Now, some things I'm less sure about (questions!). I split this up into sections, which I hope makes them easier to think about/respond to.

Specific features & calculating importance

I was confused about "look at the specific features for each sample k", but here's my interpretation. What this means is that we want the learner to look at (a) a matching of the graphs which the difference graph falls out of and which is strictly more informative (so we can distinguish "this came from cube and this came from sphere"), and (b) the specific nodes (or connected components?) in the difference rather than treating the graph as a black box. (If that's wrong, let me know. I can imagine other interpretations.)

For concreteness, here's an example in the symbolic framework. Maybe we would observe that the difference between a table and a chair is that a chair has a back, and usually a chair's legs are shorter than a table's legs (pretending we had that latter level of detail). Maybe sometimes we can't see the back, so it's not in the difference graphs, but when it does show up it guarantees (in distribution πŸ™‚) that what we're looking at is a table, so this should have a high importance weight. On the other hand if the legs are not shorter then that's fine -- maybe the table is a coffee table. That means the leg length feature should have lower weight. (Hopefully the difference in leg length causes the chair and table legs not to match so they show up in the difference graph and we can notice that difference.)

Here's one idea for the importance might be something like: In the context of a difference between cube and sphere, how often have node X been in the difference, out of all the times that node X was present in at least one of the graphs?

(This is one reason why we want to look at the matched pair, not just the difference: If we only look at the difference, then we can't distinguish between "we couldn't see the back of the chair" and "the chair and the table both (for whatever strange reason) had backs." The other reason is so that we know which pattern to push the importance weight into.)

Doing stuff with importance

if I understand right, this assessment of node importance is something we want to push down into the patterns of the other learners while we are learning. Assuming so, I suppose this is one of the reasons you wanted to be able to mutate the pattern graphs? Also, this is something we'd do after observing each pair of samples?

The other thing we might want to do is add "important" (based on n>1 sample) missing nodes back into the pattern, with extra info to flag them as "match optional." I'm not sure about the details and will probably have more questions later but just to confirm this basic picture is correct.

Both of these will probably require going backwards from the matched graph pair to the patterns that the object (or other) learner recognized in each of them.

Where does the contrastive learner get its concept signals/labels?

On a different line of thinking: The contrastive learner needs to know the relevant concept for the paired samples it is learning from. But concepts are internal to the learners and not part of the language (which is just text), the unprocessed perception, or the situation. So this must be something we get by running the lower-level learner, right?

Asides

These are mostly unrelated to the previous discussion and hopefully easier to answer:

spigo900 commented 2 years ago

Oh, one more question: The top suggests an experiment. What is it experimenting with/on? What comes out of that experiment? What are the "deliverables" -- what numbers, charts, PDFs, etc. do we hope to get from the experiment? My guess is that we would want to look at how the pattern importance values change/what nodes get added as a result of contrastive learning. We would probably want to look at these in the specific -- since there are so few concepts, there's no need to aggregate because we can just look at the data itself. So we might produce, for each of the concepts, a modified graph visualization that labels each pattern node with the difference between its final and initial weights. But there could be other things.

lichtefeld commented 2 years ago

I've organized my response with similar headings to yours. I agree with the four bullet points you start off with so we are on the same page there.

Specific Features & calculating Importance

a) Correct we want the contrastive learner to look at the difference graph as more informative which requires knowing which concept the distinguishing features came from. b) For the nodes (or connected components) I'm operating under an assumption that "the learner knows that connected to the 'root' of the concept are all the potential features which contribute to the identification of the concept". The graph is treated as less of a 'black box' in this example and more as a data structure with a known value type. The learner can know it is operating over a set of features.

I think the simple model of importance you've described is a good starting point.

Using Importance

Both of these basic ideas match my original thoughts on how to take advantage of the importance information. I'd think we would want to push the information back into the appropriate patterns for a concept after every difference evaluation yes. It may mean we need to also pass the reference to the patterns down the pipeline to make this access easier.

Getting Concept signals/labels

We will assume our curriculum design never misidentifies a concept when described linguistically. That is if the language for a scene is 'sphere' then a sphere concept is going to be present in the scene. With this assumption we can use the linguistic phrase as our label for each of the two samples in a pair. Programmatically it probably makes more sense to ask the first-stage learners for the concept which matches the input linguistic phrase but we can identify the concept from the linguistic label rather than needing to identify the objects in the graph correctly first.

Asides

x) Correct I think the contrastive learner is a meta-learner that influences how other learners make their decisions. y) I don't think we can use the contrastive learner at test time other than perhaps to generate the feature difference lists for display purposes (DARPA would like to see such a list if possible) z) I don't recall the answer to this. I don't think they do because I believe I'm pulling the current feature list for the viewer from the graph and strokes are present here but you'd have to double check the implementation to be sure. The ObjectClusterNode is still replaced by an ObjectConceptNode (required currently if we want to be able to have a recognized object participate in an ActionConcept)

Experiment with/on

As far as deliverables what you've described is what I envisioned as the easy "here's how taking advantage of difference detection impacts our ability to define an object". Hopefully system accuracy at identifying concepts at test-time is also increased. I agree that the best representation of this improvement is more of a qualitative analysis of the pattern graphs rather than strictly a quantitative result.

--

These are all good questions. Hopefully I've answered them or at least provided feedback that's useful enough to continue the conversation on a topic :)

spigo900 commented 2 years ago

@lichtefeld Thanks! Those were good answers. I think that all makes sense.

I have one more question about mapping concept signals to the graph: It seems reasonable to assume the description describes something actually in the scene, but I'm confused about how we know what that something is if the scene may contain additional objects that aren't described. Can the scene have other objects, or are we assuming no "background noise"?

I may have another question about aligning the language with concepts consistent with "ground truth", but I looked into it and I think I have an answer about how we do that in the form of enrich_during_learning(), so maybe not. πŸ™‚

lichtefeld commented 2 years ago

I have one more question about mapping concept signals to the graph: It seems reasonable to assume the description describes something actually in the scene, but I'm confused about how we know what that something is if the scene may contain additional objects that aren't described. Can the scene have other objects, or are we assuming no "background noise"?

We're really leaning into the ability to design the curriculum explicitly for the learner so we can assume essentially no "background noise" past being able to handle the signal processing. In a scene designed to train "sphere" for example we won't also have a 'ball' present.

That said, in-principle, we could probably handle a scene with background objects as we do have a way to extract out a sub-perception graph for an object concept (even if I need to go remind myself exactly how to do that). We could essentially transform a single input scene into multiple potential perception graphs and use the learner decode as the labels for the contrastive learner. While this would be possible in theory I think we can just have it available as an explanation of 'how to train with noisy real-world data' rather than an explicitly curated curriculum.

spigo900 commented 2 years ago

Okay, so we assume that there is no background noise because we can and have designed out the background noise. We probably have a way of handling them if we need to, but there's no need at this stage. That all makes sense.

I think there are some technical details left to work out, but I think I have enough information now that I can start thinking about a design, so I will start thinking about that, and hopefully get a draft to you within the next few working days.

ETA: that was a lot of thinks πŸ™‚

spigo900 commented 2 years ago

@lichtefeld I've drafted a design for contrastive learning. I'll wait for your feedback before moving ahead with this.


Also, not immediately relevant, but a question about actions. How would you pick out the concept for action samples, given the language might be different while the concept is the same? Say "a cube falls" vs. "a sphere falls". I have thought of a simple workaround for this case, but I'm curious if curriculum design has changed/can change such that this doesn't pose a problem anymore.

spigo900 commented 2 years ago

I made a list of some complications I've run into so far. I marked with βœ… the one "thread" and solution that seem straightforward to resolve. I'm marking with ❓ threads that I haven't resolved yet, with 🚨 for problems, ➑️ for "more on those problems",β“βœ… for possible solutions. (Let me know if this is helpful. I marked it because I was having trouble seeing the structure in this sea of text.)

Sorry this is long. It may make sense on Monday for me to write a shorter version.

Here's the list:

(A specter is haunting ADAM: The specter of graph matching.)

spigo900 commented 2 years ago

So let me see if I can condense this down (lossy of course, and adding some things):

I"ll plan to think a little more about these, especially proposing updated patterns.

lichtefeld commented 2 years ago

So I'm not entirely sure the best way to respond to all the open threads in a way that's not confusing so I'm going to start with just some overall feedback / respond to the condensed version and will try my best to make any specific comments clear to where they respond :)

Re: Alignments & non-deterministic matching between graphs.

Might we currently take the 'easy' route of an arbitrary matching is the 'canonical' match and we acknowledge the contrastive learner won't be purely deterministic. (Mostly annoying for reproducibility and catching bugs). We can mark this problem on its own issue and circle back to solving it better if we have time. This is essentially the same 'hack' the learning modules, thought I suspect their lack of determinism is less impactful overall than this will be for the reasons you mention.

propose_updated_patterns() & Immutability

If there's anywhere within ADAM where a class should be mutable it's arguable the Patterns themselves as these should, and do, evolve over time. The biggest problem with removing the Immutable guarantee now is a) mutability causes its own set of problems and b) the original codebase was designed with Immutability in mind so it's possible that somewhere we use a Pattern in sets or dicts as key's and so now allowing the patterns themself to mutate would be a problem. Before deciding to go the easier route of mutable patterns we'd need to do a relatively complete check for this condition.

I believe in theory Learner's can have more than one template for a concept (Pursuit can for example) but for both learner types of current curriculum design ensures that the first pattern generated does match to the correct object. So we could allow for contrastive learning to occur so long as the learner only has one Hypothesis for a concept. That 'gets around' problem one by the curriculum design which is fine given for the current phase we have complete control over the curriculum.

Re: Weights from multiple concept pairs

I agree with starting with the aggregate weight across all pairs. We may need a more fine-grained system in the future but for now aggregation sounds good to me.

spigo900 commented 2 years ago

@lichtefeld

spigo900 commented 2 years ago

Re: updating only one pattern once as a fix to propose_updated_patterns(), here are some minor problems with simple solutions:

Then there's the more complicated one of node identity. I think that because we're currently limiting ourselves to the subset learner, we could pretend it will accept the update exactly and leave the general solution as an issue for the future. Does that make sense?

A problem with tracking node identity

Because pattern nodes are immutable, we have to create new ones if we want to change the weights. But nodes implement, and must implement, identity rather than value equality. These modified-weight nodes have a different identity from the original, thus are unequal.

This poses a problem for my proposed weight assignment rule. For that rule, we need to track node identity or something very close. (If we go to counting by node type, I think that loses too much of the graph structure to be useful.) If we change node identity then our old "node counts" dictionary/counter is no longer helpful, because we'll get a value of 0 when we look up the new version of the same old node.

(Come to think of it, with this update rule we'd likely have problems with hypotheses changes invalidating the old node IDs, such that we can't do [isolated-examples learning -> contrastive learning -> isolated-examples -> contrastive]. But that seems fine, since we don't have any plans to do that.)

I suspect we could relax the demand for node identity. What we "really want" (going back to the problems with matches not being unique) is to somehow track and update nodes based on their "place" or "role" in the graph, rather than their object identity. This sort of goes along with the pattern updating issue.

But I don't have great ideas for replacing node identity. We could replace it with something more place-like -- say a pattern covering the node and its immediate neighbors (with the node "distinguished" somehow) -- but that means doing a lot more graph matching with all the problems that entails.

(I think the problems with "neighborhood patterns" are fairly serious, mainly a matter of performance. Doing this means an extra helping of nondeterminism, but that seems like the smaller problem. The bigger problem is that this we need to run the pattern matching a lot. The number of matches required should be proportional to the number of unique node-neighborhood configurations we've seen. That means the number of patterns we need to match could get very large depending on how unique the local graph structures are. I'm not sure if it helps or hurts that the patterns would be so small.)

Solving the node identity tracking problem

Solution 1: Do lots of manual bookkeeping

Since the contrastive learner is the one changing node identity, we know the mapping, so we could address this directly. Just add the new node IDs with the old count (after updating the old count).

This works as long as the learner doesn't create any new nodes while processing the update. This wouldn't hold if the learner creates new nodes during the update.

A problem with this solution is that it keeps references to old nodes. We could avoid this by deleting the old nodes from the dictionary. However, this requires further assuming that the learner doesn't use any of the old nodes. This assumption wouldn't hold if, for example, the learner rejects the pattern entirely.

Solution 2: "Wild node tagging"

Here's a kind of dumb idea. What if the contrastive learner added "tags" to each node it observes so it can identify them if/when they come up again later? That is, an identifier for each node attached to the node in the digraph structure. Since only the contrastive learner is using them it shouldn't cause too much of a problem, and because it's part of the digraph it doesn't require changing any of the predicate classes at all.

This works as long as propose_updated_pattern() preserves the "tags." I'm not sure it's reasonable to expect this, though. This seems like a borderline case of coupling. It's a touch of coupling the contrastive learner to every apprentice learner it collaborates with.

This might work better in tandem with another solution to fall back on for identifying nodes.

Solution 3: Just add more patterns!

Maybe the contrastive learner should track the patterns itself. Whenever we learn from a new example, if the apprentice's pattern for the concept isn't identical to what the contrastive learner last proposed, we do graph matching to figure out which parts are "the same" and migrate the counts appropriately. As with the other "multiple matches" problem, use an arbitrary match and leave "do something smarter with matches" for another time.

(Also, not immediately relevant, but I think re-adding nodes is going to turn out to be complicated, now that I'm thinking about the problem of "keeping track of node/subgraph identity over time." We'd have to keep track of "the parts that aren't in the pattern" in some way that (1) makes sense in the context of many different perception graphs, and (2) retains the graph structure. A messy solution could involve having, for each concept and each "other" concept, a pattern for "things that weren't in this pattern, and weren't in both perception graphs"... that requires somehow checking any the connections at the "boundary" though, to represent the structure perfectly. Though it's not clear at this stage if we need to store that structure.)

lichtefeld commented 2 years ago

The minor problems and solutions look good to me.

Solving the node Identity Problem

Do lots of manual book keeping

What if instead of the node itself we kept counts of the number of times we'd seen a given 'ontologyNode' or CategoricalNode(label="affordance", value="rollable")by the values the nodes contain. E.g. we count the number of times Counts[node.label][node.value] occurs compared to the total of len(Counts[node.label]). The main complication here that we don't have a uniform interface for nodes (I'm now very sad about this) that appear in a PerceptionGraph so there's some isinstance checking that has to be done to special case the code.

This looses any information attached to the graph structure itself but that may be ok? I can't foresee any immediate problems with this approach but I'd also plan around being wrong here and be pleasantly surprised if I was right.

'wild node tagging'

This seems like less manual bookkeeping and while it does couple the components it's not like ADAM is as uncoupled as we'd like overall πŸ™ƒ (Not that this should be a reason to add more coupling when we can avoid it). I don't like this idea but I don't hate it either.

Just add more patterns!

We'd need to do some benchmarking again (yay!) to see how much of an impact matching lots of tiny graphs is. We could generate this test by just creating test data around the size we expect our graphs to be. The only problem with this approach is we know we'd be implementing something that's likely to slow down as the search space of features grows... I'm hesitant to event suggest benchmarking due to that potential impact and a desire to avoid the last three months of this project returning to "how do you optimize the NP-Complete sub-graph isomorphism matching" problem. :)

spigo900 commented 2 years ago

@lichtefeld My impression is the graph structure is simpler/less informative than it was, and under that assumption I think tracking ontology and categorical nodes seems like a good enough solution. I'll implement that.

ETA: Thanks for the idea!

spigo900 commented 2 years ago

@lichtefeld I've implemented the count-weighting formula for ontology node predicates and categorical predicates based on the ontology node/categorical value counting scheme. (Categoricals don't yet carry a label.) Correct me if I'm wrong but I think that's all the node types we want to count right now. I looked over the other types, and they don't obviously make sense:

I have a few more things to finish up, but I think this "proof of concept" is almost ready for testing. I have one more issue I'm planning to work on re: the contrastive learner itself. I'm currently working on implementing the subset learner side of things. I'll also need to go back to the contrastive learning experiment code and make sure that still makes sense.

(The contrastive learner issue is this: I realized I'm not sure it makes sense to use the concepts from the LanguageConceptAlignment because that might be contingent on what the object learner detected and not the "ground truth" for the observation. So I need to go check where that concept alignment comes from.)

lichtefeld commented 2 years ago

(Categoricals don't yet carry a label.)

I suppose that's not a bug strictly speaking (so long as all values across all categories are unique) but given the uniqueness guarantee isn't enforced anywhere it seems like a bug waiting to happen. I'll open a new issue to discuss this because there's a reasonable argument for not adding a category label

--

I believe the list of things we're not currently counting is a fine place to start. The biggest concern I currently see is we don't have a great way to identify differences in the strokes currently but we can approach that after the initial implementation is complete. This may be one of the reasons to integrate a GNN model into ADAM ?

spigo900 commented 2 years ago

The major changes are done. I've implemented the subset learner changes. I've made sure the contrastive learning code makes sense (to me πŸ™‚).

I went back and looked at the remaining issue re: the contrastive learner and confirmed things don't work as I wanted. The LanguageConceptAlignment returned from enrich() is wrong because it tries to match the concepts recognized in the perception graph, as recognized in the perception graph, to the language. So my understanding is that if the learner thinks the cube is a sphere, it will try to match the cube template to the language and fail, so the LanguageConceptAlignment won't include any semantic nodes/concepts.

What we really need is a way to ask the apprentice learner for the concept(s) it recognizes in the language for each scene. I think I have a simple enough hack for objects. We can call copy_with_new_nodes(), passing the object learner's surface template for every possible concept. This should work because there are no slots for us to align, so the alignment code will just look for a span alignment and eventually it should find the "right" concept for each scene.

We may need to do something smarter for actions. In particular if we need to handle something like slot0_gives_slot1_to_slot2, the two fixed strings gives and to are separated by a slot that will cause matching to always fail. (When there is no alignment of slots to spans, we ignore the slot nodes and just try to match the fixed strings, so we'd be looking for gives to which won't ever match an actual example of the template.) There might be other complications for actions.

I should be able to address this tomorrow and get started on testing. @lichtefeld, where can I find the current curriculum when I start on that?

lichtefeld commented 2 years ago

@spigo900 I'll send you a zip of the current curriculum.

spigo900 commented 2 years ago

I've addressed the alignment issue using the template matching hack I described above. The contrastive learner is now pulling the concepts from the language rather than from what the apprenticed subset learner recognized in the perception. I've confirmed the concept pairs being pulled out look reasonable -- we're geting 5 each of (cube, sphere), (cube, pyramid) and (sphere, pyramid).

There's one problem left to debug. The problem is that the contrastive learner isn't able to match the apprentice's patterns for the concepts to the corresponding graphs. This results in the contrastive learner learning nothing at all. I haven't looked too deeply into this yet. I plan to give it a harder look on Monday.

spigo900 commented 2 years ago

Actually, now that I checked the node types I think I know what the problem is. There's no ObjectClusterNode but there is an ObjectSemanticNode in the perception graphs, so my guess is the apprentice learner replaces the cluster node, thus causing its own patterns to no longer match the graph. πŸ™ƒ Fortunately this should be pretty easy to fix.

lichtefeld commented 2 years ago

Actually, now that I checked the node types I think I know what the problem is. There's no ObjectClusterNode but there is an ObjectSemanticNode in the perception graphs, so my guess is the apprentice learner replaces the cluster node, thus causing its own patterns to no longer match the graph. πŸ™ƒ Fortunately this should be pretty easy to fix.

Yes the Object apprentice learners replace ObjectClusterNodes with ObjectSemanticNodes.

spigo900 commented 2 years ago

I made the change, and things are "working" but the contrastive still can't learn anything from the current curriculum. The problem is there's a mismatch between the curriculum and the learning/weighting scheme we designed. There are no Categorical nodes in the perceptions, and there are of course no SemanticNodes when we look at the un-enriched perception, so there's nothing the contrastive learner can count and it doesn't learn anything.

How do we want to address this? These absences don't seem like bugs, and don't seem fixable in the current weighting scheme.

If we don't think of a better idea before then, I'll plan to spend some time Monday thinking about a different weighting scheme that works for objects.

ETA: One stupid weighting function that could work would be "increase the weight when this node is present in the difference of the perception graphs, decrease it when it's in the intersection." (For some value of "increase" and "decrease.")

spigo900 commented 2 years ago

@lichtefeld I thought about this (see below), and if we weight things based on matching, the only predicates we can meaningfully weight at the moment (ETA: among those actually present in the subset object learner's hypotheses) are StrokeGNNRecognitionPredicate s. See the section below.

Should I go ahead and implement such a weighting scheme? I suspect the specific scheme doesn't matter too much in light of this, and it would be easy enough to extend the "counting" function to cover this predicate type.

On the other hand if weighting schemes are so limited then maybe they're too limited for an interesting experiment and not worth bothering with. Through such a scheme the contrastive object learner should be able to learn/teach something about the performance (precision/recall?) of the GNN recognizer. But this is boring in the sense that "the GNN says it's probably a cube" is completely opaque to us humans as a feature. It's not clear what this feature means about the input, other than "a bunch of math thought this input was a cube."

I don't think we can do better without #1110 (which I think we've discussed in meetings as relevant to contrastive learning). If we want to get more interesting results from contrastive object learning, I think we need to be able to be able to distinguish between strokes in more interesting ways. But strokes are continuous and we don't currently have a good way to distinguish between continuous features. I don't immediately see other ways of contrasting strokes if the GNN recognizer can't do that.

Thinking about the weighting function

Thinking out loud about what we want from a weighting function...

It has to weight at least one of the three types of nodes that actually show up in the subset learner's hypotheses, which types are:

But if we're just looking at what matched vs. didn't, how much information we can expect a contrastive learner to get out of these nodes? Some thoughts on the above types:

So as long as we are weighting based on the current matching setup, it seems like the GNN recognition predicate is the only predicate type for which we can really expect to learn useful weights. From a "system performance" perspective, it's probably good to know how much to weight the GNN node, but from a "learning contrasts" perspective, it seems pretty boring to only be able to weight that one feature. What we'd really like to do for objects is weight the strokes, but that is hard to do.

lichtefeld commented 2 years ago

Should I go ahead and implement such a weighting scheme? I suspect the specific scheme doesn't matter too much in light of this, and it would be easy enough to extend the "counting" function to cover this predicate type.

I'm not opposed to implementing such a weighting scheme. It's a potential we need more features to better explain objects as our set increases and we increase the node types that support this weighting scheme.

... But this is boring in the sense that "the GNN says it's probably a cube" is completely opaque to us humans as a feature. It's not clear what this feature means about the input, other than "a bunch of math thought this input was a cube."

Correct we'd need to be able to do better inspection in the GNN and find some way to exploit contrastive learning to either discover when a) 'ADAM thinks the GNN is wrong' or b) Better train the GNN (or a different GNN and have multiple GNN output features to investigate)

--

So as long as we are weighting based on the current matching setup, it seems like the GNN recognition predicate is the only predicate type for which we can really expect to learn useful weights. From a "system performance" perspective, it's probably good to know how much to weight the GNN node, but from a "learning contrasts" perspective, it seems pretty boring to only be able to weight that one feature. What we'd really like to do for objects is weight the strokes, but that is hard to do.

We could explore a 'better matching' system within ADAM for strokes. The biggest issue is 'anything we'd probably think to do is probably flawed in a way the GNN is adapting for'. For example, if I was given two sets of points and asked "are these lines similar" I'd plot them and see if a I can get a line of best fit that works 'reasonably well' for each. How could we do that automatically? Well we could just calculate a line of best fit (or a polynomial line to the nth degree). However this seems fairly brittle. Alternatively we could try to calculate the angle between any lines? For a cube that should get us angels of approx ~90 degrees. This 'angle between strokes' is then potentially a continuous value that needs to be matched? However that brings us back to the "what happens when we extract a cube with only three sides" problem. Subset itself can't easily handle multiple patterns for the same concept.

(That said I'm actually not opposed to trying to use some math to calculate that type of feature. Seems like a fairly-low hanging fruit to embed)

And there's a question of if this level of feature interest is important anyway when instead we could put the time into having contrastive learning integrate with (the/a) GNN system.

--

Once affordances are a thing that's a potentially better capture for the contrastive learner? (It's at least hopefully a categorical node which can be counted more easily...)

spigo900 commented 2 years ago

Weighting

I just extended weighting to cover StrokeGNNNodes, took ~10 minutes plus ~5-10 of debugging small unrelated issues. (Turns out I forgot to actually call propose_updated_patterns() after constructing the updated pattern. πŸ™ƒ) The results show it works as expected, though they are fairly boring: It weights the graph recognition nodes as high as possible. The GNN doesn't break on any of the contrastive learning samples and misclassify the paired instances as having the same class.

(I'm adding 0.5 to the final weight for "things we can actually weight" because otherwise it's impossible to distinguish between "it's at the default weight" (1.0) and "it's at the max weight for this scheme" (also 1.0). We may eventually want to do something smarter re: relative weights of unweightable nodes vs. weighted nodes.)

On a semi-related note, it looks like the situation doesn't store the path to its own feature.yaml file. I was wondering why it wasn't getting copied when it looked like it should be according to this bit of code. I was going to use the copies to check the GNN's output for the contrastive samples used. I may look into this more tomorrow.

Affordances

I agree, affordances seem more interesting for contrasts. Though if the affordance learner runs as part of the action learner, we'll never naturally end up with an affordance in the object patterns, so there won't be any to weight. It definitely feels like we should be able to do something with affordances and contrastive learning, though. Maybe this is a case where we want to do the "reintroduce nodes in pattern" thing (though here the nodes were never present so it's not strictly re-adding nodes that got shaved off).

Strokes

I don't have more to add at this point, though I want to think more about this tomorrow.

spigo900 commented 2 years ago

Looks like the feature thing is a simple problem at least. When we load the curriculum we search for feature files using the glob expression feature_* but the filename is just feature.yaml so it doesn't match.

spigo900 commented 2 years ago

One difficulty with using affordances here is that affordances would run after the object learner has already replaced the object cluster node. So the contrastive object learner can't straightforwardly work with the affordance learner's enriched output perception graphs, because the object learner's patterns won't match to that graph anymore. We would have to either hack it (undo the cluster->semantic replacement in the graph, or replace the cluster nodes in the pattern), or... probably there's a better solution if I thought about it more.

ETA: re the hack, it probably makes more sense to change the graph, since that doesn't force extra specialcasing to deal with pattern updates

spigo900 commented 2 years ago

Some thoughts here on handling distances and some looser thoughts on introducing missing pattern nodes.

Distances

As mentioned at this week's team meeting, this needs to handle distances to handle actions. Distances are likely to show up in actions. For example, with "take": You have to be pretty close to an object to take it. Distances pose some issues for contrastive learning. First, distances are represented as two separate nodes. Second, distances are continuous.

Distances are represented as two separate nodes: A discrete direction and orientation (axis) as well as a node measuring the distance in meters. One problem that may come up is that the patterns we're matching and contrasting might contain one of these nodes but not the other -- the distance without the axis, or the axis without the distance. I think this is not a problem for the current approach, or rather to the extent it's a problem we'd want to deal with it in other ways. However, this might provide some motivation for adding missing nodes to patterns. If one of the two distance nodes is only relevant when distinguishing action X from action Y then maybe it's useful to be able to add that back in. Looking at our 20 actions, I don't see an obvious pair for which this should be true, so I don't think it's worth implementing that at this point, but we might discover such a pair later and find that this is worthwhile.

The more difficult problem, I think, is that distances are continuous. That means they cause all the same problems for contrastive learning as colors and strokes. However, in theory we already have to deal with the continuity of distances to learn actions in the first place, as long as we're not strictly relying on the actions GNN. I'm hoping that whatever approach we use for actions, we can also use for contrastive learning. @lichtefeld, thoughts on this?

Missing nodes

One way of doing this might look as follows:

  1. Add IDs to pattern nodes.
    1. These aren't used for matching, only as an ugly hack so we can trace pattern nodes across iterations.
    2. How do we generate unique IDs, though? I want to say we leave it to the learner to make sure IDs are unique within its own current patterns, i.e. at any given point no two pattern nodes share an ID no matter which of the learner's patterns they belong to. That introduces its own headaches, but it is a partial answer to the question. Let's say the learner uses incremental IDs. It tracks next_id: int and just increases it each time it assigns some IDs. Maybe write a convenience function that handles assigning IDs when we create a pattern from a graph. Then we increment the next unused ID by the number of nodes in the graph.
  2. Track neighboring perception nodes by type, attributes, and incoming/outgoing edges to pattern node neighbors (which includes the edge labels). Count how often this "configuration" appears conditional on the contrast pair we're looking at.
    1. Though tracking attributes invites the ugly Continuous Value Problem again... πŸ™
  3. If we get above say X observations for a given neighbor configuration, and Y% of contrastive observations for the current concept pair given pair include it, add a special pattern node to the pattern graph. This node is a wrapper around a normal pattern node, and is "allowed to fail" somehow.
    1. The "allowed to fail" part might require changes to the graph pattern code to make it work. I'm hoping we either already have or could write a hack, similar to what we do for intersecting patterns, that means we wouldn't need to change the actual graph matching code.
    2. Problem: Now we need the learner to assign this new node an ID. πŸ™ƒ
      1. And they should increment their internal next_id when they do this, to avoid any possible problems.
lichtefeld commented 2 years ago

The more difficult problem, I think, is that distances are continuous. That means they cause all the same problems for contrastive learning as colors and strokes. However, in theory we already have to deal with the continuity of distances to learn actions in the first place, as long as we're not strictly relying on the actions GNN. I'm hoping that whatever approach we use for actions, we can also use for contrastive learning. @lichtefeld, thoughts on this?

It's on my list to implement after I wrap up the initial affordance learner but I'm planning to match continuous values with a distribution. So during the learning stage every introduction of a new distance would add to a running mean, standard deviation, max, and min values. (max and min is mostly for being able to display those values... I'm unsure they are strictly needed for determining a match). Come time for matching a match is made if the determined values has a greater than X% chance of being in the distribution formed by the mean and std deviation. This % chance should probably be set by a parameter for the moment. Ideally this value would instead contribute to the 'match score' in some way but that's a larger technical challenge than I want to take on if we don't have to.

@spigo900 Thoughts on this approach? We also discussed in slightly in our technical meeting. If you're at a point where you need something to implement this may be a reasonable task to take on.

Re: Missing nodes

I think I'd need to consider this problem more to have any more advice.

spigo900 commented 2 years ago

@lichtefeld I think that makes sense if we can implement it. I'm not seeing an easy place to slot in that logic. Ideally it would go in the node but it's not clear how that would work since the nodes are immutable. Also we kind of have to "confirm" a match before we can finalize the update, so putting it in the equivalence logic seems doubly not ideal. I think we used to do some post-processing after intersection, and that might be a good fit for the subset learner, but leaves other learners hanging. Overall not sure what to do here.

Missing nodes

I agree about IDs. Re: match ratio, as far as PyCharm can find we currently use match ratio only for propose-but-verify and cross-situational learners. (I thought pursuit used it as well but that's not showing up.) Currently we're only using the subset learner as it's not clear how to update patterns for the other two, and the subset learner doesn't compute a match ratio (it's relatively fragile that way). We could change this, but I'm not familiar enough with the ratio stuff at the moment to comment on how complicated that might be. Also couldn't comment yet on the "doesn't count in ratio" approach for missing nodes.

It seems like we have to deal with one of two really annoying problems:

I'll have to think more about this.

lichtefeld commented 2 years ago

I think ideally we'd probably want to redefine our perception graph and pattern graph data structures to conform to the new assumptions but that's almost certainly out of scope.

I agree that the resolution of updating the pattern nodes is probably a post_processing step which should be possible to define... I'd have to go through the match process to see where that could be slotted in the easiest.

Also, huh, would have sworn Pursuit used the match ratio somewhere... I wonder if PyCharm just can't find it because of inheritance structures...

It may be worth ~1/2 hours of investigation to see how much would break if we could just let pattern nodes be mutable. I suspect a lot so that change wouldn't be practical but wouldn't know for sure without investigating.

spigo900 commented 2 years ago

Ah, I figured out the ratio issue. Pursuit does use a match ratio. What pursuit doesn't do is use compute_match_ratio() to calculate it. It uses its own instance method, _find_partial_match(). The objects version of this method could easily be refactored to use compute_match_ratio(). It looks like the pursuit attributes and pursuit relations learners would require only slightly more work. Though that's not needed at this point.

I'll plan to take a half hour today to look into immutability.

spigo900 commented 2 years ago

You know, I think maybe mutating pattern nodes is okay. I was slightly worried mutability would mean we couldn't hash pattern nodes thus breaking a lot of things, but I ran the tests and nothing broke. It looks like attrs is smart enough to still take care of hashing when eq=False. I looked at the usages for NodePredicate and didn't see anything obviously broken.

I think we still want to strictly control where and when we mutate nodes -- we probably want to be able to match patterns without updating the underlying nodes -- but I don't see any place where it obviously breaks things. I looked at references to NodePredicate, to PerceptionGraphPattern, and to PerceptionGraphTemplate. I didn't see anything obviously broken, though of course I didn't look at all of the references in detail in ~30 minutes.

I am slightly worried this is missing something -- I suspect PyCharm is not picking up on references to instance variables, only arguments and return types. But I can't think of anything that should break.

For distances I think we want to constrain this predicate update to happen only on a "successful" pattern match. The problem is that we don't know what a "success" means from inside the matching code -- that's defined by the learners themselves. Given this, I think we want to contain the update logic to a confirm_match() method on the Pattern that delegates to each node, passing a match object. This is not ideal because it means we have to remember to add this call in the appropriate places, and we need to get a match object in some places where we didn't use to do that. However, it would prevent accidental updates.

lichtefeld commented 2 years ago

Ok, do you want to take on the task of implementing this matching & mutability for continuous values? Also this new matching strategy may have an impact on how an intersection of a match works? Essentially we'd want to make sure the continuous value with a given label isn't dropped from the pattern.

Edit: I think we should probably only update any probability range after a successful match. There's an argument to be made that during training we should match only by the label of the continuous value and during evaluation should actually test the range but that may not be easy to implement right now?

spigo900 commented 2 years ago

Sure, I can take this on. I'll create a separate issue for continuous value matching/mutability.