rust-lang / gll

GLL parsing framework.
Apache License 2.0
138 stars 25 forks source link

Disambiguation based on PartialOrd between choices of a certain node type. #42

Open eddyb opened 6 years ago

eddyb commented 6 years ago

@eternaleye suggested that instead of having only "validators", that remove "invalid" choices from the SPPF, a more general approach involves PartialOrd, or more specifically, a function taking two Handle<X> and returning an Option<Ordering>, where disambiguation means "find the largest X".

When the ordering between two choices is equal, that means it's still ambiguous (i.e. the default for node types with no user-provided disambiguation logic), while None means neither choice is valid.

Validity can be embedded by mapping all possible pairs of validities to partial orderings:

impl<T: Validate> Disambiguate for T {
    fn disambiguate(a: Handle<Self>, b: Handle<Self>) -> Option<Ordering> {
        match (a.validate(), b.validate()) {
            (false, false) => None,
            (false, true) => Some(Ordering::Less),
            (true, false) => Some(Ordering::Greater),
            (true, true) => Some(Ordering::Equal),
        }
    }
}
rljacobson commented 4 years ago

I think you might have to project into a partial order from a relation that lacks transitivity, because I feel like a rock-paper-scissors relationship is likely. The ideal properties would be a (upper-)semilattice, which has the additional property that for any two elements x and y, there exists a z such that z > x and z > y. (So z is a l.u.b. of {x,y}.)

If the goal is just to select a likely parse rather than the most likely parse, there probably isn't any harm in extending the partial order arbitrarily to a total order if it's convenient to do so. But that's somehow unsatisfying, isn't it...

A copy+paste of my note to Eddy: I am interested to know how a probabilistic model would perform. So you have a set of parse trees and want to find the "correct"/intended parse tree. Ahead of time you gather statistics about how language structures relate to each other locally by analyzing, say, all of crates.io. Check each parse tree against the statistical model to find the most likely. What I'm describing is syntactic, but it could in principle include other attributes like type, scope, some complexity metric, proximity... And your notion of validators is complementary, not mutually exclusive. In ML we tend to just dump all the factors into one big bucket and study the principle components of the resulting space. We can then remove any factors that appear useless or project to a lower-dimensional space. But, I mean, that's something you'd do, if you do it at all, only after you have a framework to implement any of this.