Closed spigo900 closed 2 years ago
Some responses to questions raised:
How do we calculate a match percentage? Mean and variance alone don't give us any sort of percent match
Correct, I was thinking just a simple implementation of a p-test within a Gaussian distribution. Ideally we'd actually just use the results of the p-test to be 'the weight of this node match' which contributes to the match ratio but to limit the amount of refactoring a single threshold for a match works, we'll just need to find a reasonable threshold for now.
How do we perform updates? I think we want to do this in an online/constant time way that does not weight old examples too heavily (but also doesn't overweight new ones). This should probably factor into how we calculate match percentages.
I vote for using Welford's algorithm as a starting point. Remember that we design the curriculum we're trying to train so we can place the responsibility for weighting samples of observation on the curriculum design rather than on the learner.
How do we parameterize the % chance threshold? In the current scheme I think this has to go in the node because otherwise the node wouldn't have access to the threshold at match time. We don't yet have a better way to parameterize things, I think.
I think currently we should set a single 'continuous value threshold' on the entire learner for configuration but yes it will need to propagate down into the nodes.
How do we calculate match percentages for and update on pattern-pattern matches? In general this seems to require calculating a similarity between two distributions. For now we could ignore the problem and just throw an error whenever we try to match two same-label continuous features that each have more than one observation. However, this may cause problems for contrastive learning (see section below on the "tie-in problems").
I believe pattern-pattern matches are used for intersections/generalization right? If so, we never want to drop the continuous value patterns from the ability to match (e.g. they are always present and can just 'fail to match') so maybe just a match against the label is sufficient? This would be a larger problem to solve in pursuit than I think it is in subset? (please correct me if I'm wrong here it's been a while since I've looked at the matching code in much detail)
Tie-Ins with #1102
Perhaps in a situation where you only have one continuous value to contrast the answer is it has to be contrasted on the basis of the value alone and not any distribution? You can't have a distribution of only one value so an exact match is the best you can do? E.g. if you compare a scene with two blocks where one is a big cube and the other is a small cube then a contrastive difference is their absolute size when compared against each other. This comparison comes from the exact values rather than the distribution set.
This may complicate contrastive learning in other ways but thoughts?
@lichtefeld That all makes sense. Some specific responses:
I believe pattern-pattern matches are used for intersections/generalization right?
At least intersection, probably generalization. I think letting them always match seems like a good enough solution.
one continuous value
I think this is correct and the proposed solution is good enough. So we'd essentially fall back on the old value+tolerance matching scheme when we have just one observation. That makes sense to me.
@spigo900 We may also want a count of how many samples we've seen, we probably don't want to do a distribution test with less than Y amount of samples? (Maybe Y is 3?)
@lichtefeld Right, that makes sense. So test for 3 or more examples, fall back if we have only 1 or 2 examples.
We want to implement better matching for continuous features, mainly to support better action learning. We need to be able to match distances to learn constraints for certain actions, e.g. "to take something you have to be close to it". The current approach isn't ideal -- we'd like to learn a range of values for how close we have to be, rather than fixing a tolerance.
I'm including a work-in-progress plan. I'm going to edit this post as that plan changes to keep things in one place for our own sanity.
Rough design
Some questions this raises:
Issues related to Gaussian match percentages
Gaussian match percentages seem simple but pose some problems, particularly with updating. We have options:
Re: estimating variance with exponential smoothing, I did some searching but didn't find anything obvious. The closest relatives I could think of were BatchNorm and the Adam optimizer. However, neither seems like a good fit. The BatchNorm paper doesn't describe any such thing; PyTorch seems to track a running mean and variance but I haven't yet found the details on how it calculates those. Meanwhile, the Adam optimizer calculates a running uncentered second moment rather than a variance/centered second moment, and I don't know of a way to calculate the centered moment from the uncentered one, so that doesn't seem helpful.
I plan to look a bit more into PyTorch's BatchNorm and also see if I can quickly find a way of getting centered variance from uncentered variance.
Tie-in problems with #1102
Some problems here tie in with #1102.
First, there's pattern-pattern matching. To do contrastive learning using graph matching, I think we have to either (1) intersect each perception graph with the other perception graph, (2) intersect each pattern with the other pattern. Alternatively we could (3) match each pattern to both perception graphs. Currently we do (1) which I worry will have wonky results with or without the new matching system because we can't properly estimate a distribution to match from single graphs (variance is the big issue). Doing (2) would require handling continuous pattern-pattern matching properly. Doing (3) might work, but is different enough to give me a headache so I would have to think about whether this approach makes sense.
Second, there's the issue of how to update continuous nodes in intersections. I don't think we want this to mutate the input pattern; that seems to go against the contract for intersection. We probably want to handle these similarly to a "regular" match, except it confirms the matches on a copy of the original and returns the updated pattern. That complicates #1102. However, given we've already bitten that bullet in other ways this is probably okay.
Plan
NodePredicate
an abstract methodconfirm_match(matched_with: Union[NodePredicate, PerceptionGraphNode]) -> None
. For all existing predicates this does nothing.ContinuousValueMatcher
. This provides the following methods:match_score(value: float) -> float
-- should be between 0 and 1 in current thinking, though I think this is mainly for conceptual conveniencesimilarity(matcher: ContinuousValueMatcher) -> float
-- this ranges from 0 to 1 -- I suspect this will be problematic to compute in general, so probably we'll just limit ourselves to same-type similarity and raise an error if you ask for something more complicated. This should be symmetric in the sense thata.similarity(b) == b.similarity(a)
no matter whata
andb
you use -- maybe with some allowance for floating point errors.merged_with(matcher: ContinuousValueMatcher) -> ContinuousValueMatcher
-- again, I think we can limit ourselves to same-type mergingupdate_on_observation(value: float) -> None
ContinuousValueMatcher
that we can use, which implements assume-it's-Gaussian matching. Use Welford's algorithm to update on single observations. Similarity is left unimplemented and raises an error as we may not even need it given we are ignoring the distributions when matching distribution nodes. Merging is implemented only when one of the matchers has just a single observation, and otherwise raisesNotImplementedError()
. There should be a way to merge them but I'd have to work out the math for handling the variance merge and we don't need it yet so leaving it undefined for now seems OK. Some links that might be relevant for this later: math exchange answer 1, answer 2, sketchy page with similar formula.DistributionalContinuousNodePredicate
which has alabel
,matcher
, andmin_match_score
. This node is mutable.__call__(graph_node)
returnsself.matcher.match_percentage(graph_node.value) >= self.min_match_score
for continuous nodes andFalse
otherwise.is_equivalent()
returns true if the other node is the same type and has the same thelabel
. We do not need to match distributions.matches_predicate(other)
returns true if the type matches and the labels match. We do not need to match distributions.confirm_match(matched_with)
delegates to the matcher: For continuous feature perception nodes,self.matcher.update_on_observation(matched_with.value)
, while for continuous feature pattern nodes we doself.matcher = self.matcher.merge_with(matched_with.matcher)
. Note that for the current plan this means we can't actually merge with other pattern nodes.confirm_pattern_match()
toPerceptionGraphPatternMatch
which mutatesmatched_pattern
.confirm_graph_match()
toPerceptionGraphPatternMatch
which mutatesgraph_matched_against
when that happens to be aPerceptionGraphPattern
.PerceptionGraphPattern.from_graph()
which takes a continuous matching threshold for continuous values._matching_threshold: float
toAbstractTemplateLearner
.PerceptionGraphPattern.from_graph()
.