isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
11 stars 3 forks source link

Better matching for continuous features #1119

Closed spigo900 closed 2 years ago

spigo900 commented 2 years ago

We want to implement better matching for continuous features, mainly to support better action learning. We need to be able to match distances to learn constraints for certain actions, e.g. "to take something you have to be close to it". The current approach isn't ideal -- we'd like to learn a range of values for how close we have to be, rather than fixing a tolerance.

I'm including a work-in-progress plan. I'm going to edit this post as that plan changes to keep things in one place for our own sanity.

Rough design

So during the learning stage every introduction of a new distance would add to a running mean, standard deviation, max, and min values. (max and min is mostly for being able to display those values... I'm unsure they are strictly needed for determining a match). Come time for matching a match is made if the determined values has a greater than X% chance of being in the distribution formed by the mean and std deviation. This % chance should probably be set by a parameter for the moment. Ideally this value would instead contribute to the 'match score' in some way but that's a larger technical challenge than I want to take on if we don't have to.

Some questions this raises:

  1. How do we calculate a match percentage? Mean and variance alone don't give us any sort of percent match. Two problems here:
    1. We don't have a distributional form and no general one may exist. This isn't fatal, though -- there are a few different methods we could reasonably try to estimate a distribution. Like: Use a Gaussian, use percentiles, or use kernel density estimation (KDE) maybe with a Gaussian kernel. And I'm sure there are other methods we could try.
    2. Even if we assume a particular distributional form, we don't really have a match percentage. The closest thing we have is something like a hypothesis test p-value, "how consistent is the given value with this distribution?" But we do have that value, and I think that seems like a reasonable first attempt.
  2. How do we perform updates? I think we want to do this in an online/constant time way that does not weight old examples too heavily (but also doesn't overweight new ones). This should probably factor into how we calculate match percentages.
  3. How do we parameterize the % chance threshold? In the current scheme I think this has to go in the node because otherwise the node wouldn't have access to the threshold at match time. We don't yet have a better way to parameterize things, I think.
  4. How do we calculate match percentages for and update on pattern-pattern matches? In general this seems to require calculating a similarity between two distributions. For now we could ignore the problem and just throw an error whenever we try to match two same-label continuous features that each have more than one observation. However, this may cause problems for contrastive learning (see section below on the "tie-in problems").

Issues related to Gaussian match percentages

Gaussian match percentages seem simple but pose some problems, particularly with updating. We have options:

  1. Running average/variance over all samples. We can calculate these efficiently using Welford's algorithm. However, this weights all examples equally, which could be a problem if for example we were to train ADAM on a bunch of examples and then later need to adapt ADAM to some arbitrarily chosen new examples.
  2. Exponential smoothing, i.e. using an exponential moving average. This works fine for estimating the mean, but for the variance I haven't yet found an equivalent.

Re: estimating variance with exponential smoothing, I did some searching but didn't find anything obvious. The closest relatives I could think of were BatchNorm and the Adam optimizer. However, neither seems like a good fit. The BatchNorm paper doesn't describe any such thing; PyTorch seems to track a running mean and variance but I haven't yet found the details on how it calculates those. Meanwhile, the Adam optimizer calculates a running uncentered second moment rather than a variance/centered second moment, and I don't know of a way to calculate the centered moment from the uncentered one, so that doesn't seem helpful.

I plan to look a bit more into PyTorch's BatchNorm and also see if I can quickly find a way of getting centered variance from uncentered variance.

Tie-in problems with #1102

Some problems here tie in with #1102.

First, there's pattern-pattern matching. To do contrastive learning using graph matching, I think we have to either (1) intersect each perception graph with the other perception graph, (2) intersect each pattern with the other pattern. Alternatively we could (3) match each pattern to both perception graphs. Currently we do (1) which I worry will have wonky results with or without the new matching system because we can't properly estimate a distribution to match from single graphs (variance is the big issue). Doing (2) would require handling continuous pattern-pattern matching properly. Doing (3) might work, but is different enough to give me a headache so I would have to think about whether this approach makes sense.

Second, there's the issue of how to update continuous nodes in intersections. I don't think we want this to mutate the input pattern; that seems to go against the contract for intersection. We probably want to handle these similarly to a "regular" match, except it confirms the matches on a copy of the original and returns the updated pattern. That complicates #1102. However, given we've already bitten that bullet in other ways this is probably okay.

Plan

  1. Add to NodePredicate an abstract method confirm_match(matched_with: Union[NodePredicate, PerceptionGraphNode]) -> None. For all existing predicates this does nothing.
  2. Define a new protocol ContinuousValueMatcher. This provides the following methods:
    1. match_score(value: float) -> float -- should be between 0 and 1 in current thinking, though I think this is mainly for conceptual convenience
    2. similarity(matcher: ContinuousValueMatcher) -> float -- this ranges from 0 to 1 -- I suspect this will be problematic to compute in general, so probably we'll just limit ourselves to same-type similarity and raise an error if you ask for something more complicated. This should be symmetric in the sense that a.similarity(b) == b.similarity(a) no matter what a and b you use -- maybe with some allowance for floating point errors.
    3. merged_with(matcher: ContinuousValueMatcher) -> ContinuousValueMatcher -- again, I think we can limit ourselves to same-type merging
    4. update_on_observation(value: float) -> None
  3. Define a concrete implementation of ContinuousValueMatcher that we can use, which implements assume-it's-Gaussian matching. Use Welford's algorithm to update on single observations. Similarity is left unimplemented and raises an error as we may not even need it given we are ignoring the distributions when matching distribution nodes. Merging is implemented only when one of the matchers has just a single observation, and otherwise raises NotImplementedError(). There should be a way to merge them but I'd have to work out the math for handling the variance merge and we don't need it yet so leaving it undefined for now seems OK. Some links that might be relevant for this later: math exchange answer 1, answer 2, sketchy page with similar formula.
  4. Define a new class DistributionalContinuousNodePredicate which has a label, matcher, and min_match_score. This node is mutable.
    1. __call__(graph_node) returns self.matcher.match_percentage(graph_node.value) >= self.min_match_score for continuous nodes and False otherwise.
    2. is_equivalent() returns true if the other node is the same type and has the same the label. We do not need to match distributions.
    3. matches_predicate(other) returns true if the type matches and the labels match. We do not need to match distributions.
    4. confirm_match(matched_with) delegates to the matcher: For continuous feature perception nodes, self.matcher.update_on_observation(matched_with.value), while for continuous feature pattern nodes we do self.matcher = self.matcher.merge_with(matched_with.matcher). Note that for the current plan this means we can't actually merge with other pattern nodes.
  5. Add a method confirm_pattern_match() to PerceptionGraphPatternMatch which mutates matched_pattern.
  6. Add a method confirm_graph_match() to PerceptionGraphPatternMatch which mutates graph_matched_against when that happens to be a PerceptionGraphPattern.
  7. Add a keyword argument to PerceptionGraphPattern.from_graph() which takes a continuous matching threshold for continuous values.
  8. Learners take a continuous matching threshold which they propagate into their nodes. This means:
    1. Add a new attribute _matching_threshold: float to AbstractTemplateLearner.
    2. Modify each learner to pass this along to PerceptionGraphPattern.from_graph().
  9. Learners confirm matches when appropriate. This means:
    1. Modify the subset learner to confirm the match after a successful intersection.
    2. Create and link an issue to handle match confirmation for other learners.
lichtefeld commented 2 years ago

Some responses to questions raised:

How do we calculate a match percentage? Mean and variance alone don't give us any sort of percent match

Correct, I was thinking just a simple implementation of a p-test within a Gaussian distribution. Ideally we'd actually just use the results of the p-test to be 'the weight of this node match' which contributes to the match ratio but to limit the amount of refactoring a single threshold for a match works, we'll just need to find a reasonable threshold for now.

How do we perform updates? I think we want to do this in an online/constant time way that does not weight old examples too heavily (but also doesn't overweight new ones). This should probably factor into how we calculate match percentages.

I vote for using Welford's algorithm as a starting point. Remember that we design the curriculum we're trying to train so we can place the responsibility for weighting samples of observation on the curriculum design rather than on the learner.

How do we parameterize the % chance threshold? In the current scheme I think this has to go in the node because otherwise the node wouldn't have access to the threshold at match time. We don't yet have a better way to parameterize things, I think.

I think currently we should set a single 'continuous value threshold' on the entire learner for configuration but yes it will need to propagate down into the nodes.

How do we calculate match percentages for and update on pattern-pattern matches? In general this seems to require calculating a similarity between two distributions. For now we could ignore the problem and just throw an error whenever we try to match two same-label continuous features that each have more than one observation. However, this may cause problems for contrastive learning (see section below on the "tie-in problems").

I believe pattern-pattern matches are used for intersections/generalization right? If so, we never want to drop the continuous value patterns from the ability to match (e.g. they are always present and can just 'fail to match') so maybe just a match against the label is sufficient? This would be a larger problem to solve in pursuit than I think it is in subset? (please correct me if I'm wrong here it's been a while since I've looked at the matching code in much detail)

Tie-Ins with #1102

Perhaps in a situation where you only have one continuous value to contrast the answer is it has to be contrasted on the basis of the value alone and not any distribution? You can't have a distribution of only one value so an exact match is the best you can do? E.g. if you compare a scene with two blocks where one is a big cube and the other is a small cube then a contrastive difference is their absolute size when compared against each other. This comparison comes from the exact values rather than the distribution set.

This may complicate contrastive learning in other ways but thoughts?

spigo900 commented 2 years ago

@lichtefeld That all makes sense. Some specific responses:

I believe pattern-pattern matches are used for intersections/generalization right?

At least intersection, probably generalization. I think letting them always match seems like a good enough solution.

one continuous value

I think this is correct and the proposed solution is good enough. So we'd essentially fall back on the old value+tolerance matching scheme when we have just one observation. That makes sense to me.

lichtefeld commented 2 years ago

@spigo900 We may also want a count of how many samples we've seen, we probably don't want to do a distribution test with less than Y amount of samples? (Maybe Y is 3?)

spigo900 commented 2 years ago

@lichtefeld Right, that makes sense. So test for 3 or more examples, fall back if we have only 1 or 2 examples.