microsoft / prose

Microsoft Program Synthesis using Examples SDK is a framework of technologies for the automatic generation of programs from input-output examples. This repo includes samples and sample data for the Microsoft Program Synthesis using Example SDK.
https://microsoft.github.io/prose/
Other
619 stars 100 forks source link

Using Filter with Kth #61

Open zachwood0s opened 3 years ago

zachwood0s commented 3 years ago

Hi PROSE team! I've been working on creating a DSL for web scraping and it currently supports various tree/list operations to find data in a DOM tree. For selecting/filtering elements in my "node lists" I'm currently using the built-in functions Kth and Filter. My current DSL is shown below:

@input ProseHtmlNode tree;

@start IReadOnlyList<ProseHtmlNode> program := rule;

IReadOnlyList<ProseHtmlNode> rule :=
    := Concat(rule, rule)
     | MatchNodes(match, nodes) = Filter(\x: ProseHtmlNode => match, nodes)
     | nodes

IReadOnlyList<ProseHtmlNode> nodes 
    := Children(subTree)
         | Descendants(subTree)
     | Single(subTree)

ProseHtmlNode subTree 
    := tree
     | SelectChild(rule, k) = Kth(rule, k)

bool match 
    := MatchTag(x, tag)
     | MatchAttribute(x, attr)
     | True()

This is then able to generate programs like Single(SelectChild(Descendents(tree), 1)) and MatchNodes(MatchTag(x, "div"), Descendants(tree)) but it fails when trying to select the Kth element out of a filtered list. For something like "Give the first node with the tag of 'div'" I'm hoping to generate something like: Single(SelectChild(MatchNodes(MatchTag(x, "div"), Descendants(tree)), 0))

I currently don't have witness functions for either SelectChild or MatchNodes as I wasn't sure if I needed to write them for built-in functions (there was some mention that the framework could more aggressively optimize the built-ins so I didn't want to interfere with that). I have noticed that it gets close when calling the witness function for MatchTag but the incoming args are never really satisfiable by the predicate. For example, if I was searching for the first "div" node, Filter will clearly be attempting to filter the correct list that contains that "div" node, but MatchNode will get examples like:

("div", true),
("exclude", false),
("div", false),
("div", false)

Which, with the way I have MatchTag implemented, won't contain a single tag that satisfies all the examples. It seems like for this to work, MatchTag would need to receive the examples:

("div", true),
("exclude", false),
("div", true),
("div", true)

and have SelectChild select only the first of the filtered list, but I'm not sure what I need to change to make that happen.

The tree synthesis portion of the repo can be found here if you'd like to take a look: https://github.com/zachwood0s/dom_program_synthesis/tree/main/ProseTutorial/tree_synthesis

Any help or suggestions on how to get this interaction to work would be greatly appreciated!

danpere commented 3 years ago

Thank you for your detailed question. I have not gotten a chance to look at your code in detail yet; I'll respond in more detail once I do.

You are correct that you should not need to define witness functions for uses of built-ins like Kth and Filter. The framework provides witness functions for them for some Spec types. Specifically, I see for Kth, ExampleSpec and DisjunctiveExamplesSpec witness functions and for Filter, SubsequenceSpec, PrefixSpec, and ExampleSpec (but no DisjunctiveExamplesSpec... I assume there's some reason for that, but I'm not sure what it is).

Also, for debugging such issues, if you haven't already been doing so, it helps a lot to look at the synthesis log. When constructing a SynthesisEngine.Config you can use the LogListener field to pass in a LogListener object and call .SaveLogToXML() after the learn to get an XML log of all of the witness functions with the arguments they were called with and their return values.

zachwood0s commented 3 years ago

Thanks for the quick response! I did not know about the LogListener but that will definitely come in handy when debugging! Also, I just realized that my latest commit in my witness functions includes an empty witness function for MatchTag on parameter 0. I was using that mostly for testing to see what values it would try so you can ignore that.

On another note, is there a way to view the .XML log that is generated so that it fills in the ## references with the actual value at the top?

zachwood0s commented 3 years ago

I added in the LogListener and thought the logs that were generated could be helpful. I noticed that with the current DSL, Kth does attempt to use a DisjunctiveSubsequenceSpec but it looks like the only MatchNodes that it attempts to learn is MatchNodes(True(), Descendants(tree)). I'm assuming this is because MatchTag fails to select a tag parameter that works (similar to what I was saying in my initial post).

The log that I'm pulling this from can be found here: https://github.com/zachwood0s/dom_program_synthesis/blob/main/synthesis_log_match_true.xml Line 2157 seems to be the beginning of the Kth operations. Line 2247 seems to be where it attempts to learn the MatchNodes rule again but only shows True() as the selected predicate.

Let me know if you need any more information!

danpere commented 3 years ago

No, the ## references are separate in part to avoid problems with circular references and the log getting too big due to writing out the same value many times. You could try writing your own script to copy them, but it's probably easier to just have a couple separate windows open viewing the same XML file. Sorry, I know those logs are awkward to read, but it's somewhat unavoidable because they are very information-dense.

Also, I'm sorry, but I'm very busy at the moment, and while I can answer questions, I will not have time to actually take an in-depth look at your code this week.

zachwood0s commented 3 years ago

No, the ## references are separate in part to avoid problems with circular references and the log getting too big due to writing out the same value many times.

Ah ok that makes sense, I'll just pull up multiple windows.

Also, I'm sorry, but I'm very busy at the moment, and while I can answer questions, I will not have time to actually take an in-depth look at your code this week.

That's perfectly fine! I hope you don't mind if I continue to bounce questions off of you while I attempt to solve this though.

I've been digging into the log more and it looks like this may be caused by Filter or the lambda expression not accepting a DisjunctiveExamplesSpec but I'm not entirely sure. It appears to start down the correct path (using Kth) but doesn't attempt to learn any of the predicates other than True(). This can be seen in the log below. This is where it starts the KthDisjunctiveLearner:

<UsePlan elapsed="00:00:00.0161113" verify="false" hasDependencies="true" plan="[&quot;KthDisjunctiveLearner.WitnessSequence&quot;, &quot;KthDisjunctiveLearner.WitnessK&quot;]">
  <Witness index="0" elapsed="00:00:00.0029815" param="rule" function="KthDisjunctiveLearner.WitnessSequence">
    <witness>
      <DisjunctiveSubsequenceSpec examples="2">
        ...
      </DisjunctiveSubsequenceSpec>

And here is where it attempts to learn MatchNodes inside of the KthDisjunctiveLearner but it looks like the _LFun0 learning doesn't learn the predicate like I'd expect it to:

<LearnRule elapsed="00:00:00.0027615" rule="MatchNodes" programCount="1">
  <result>
    <Direct size="1" symbol="rule">
      <Program i="0"><![CDATA[MatchNodes(True(), Descendants(tree))]]></Program>
    </Direct>
  </result>
  <UsePlan elapsed="00:00:00.0005979" verify="true" hasDependencies="false" plan="[&quot;GrammarRule.WitnessAll&quot;, &quot;GrammarRule.WitnessAll&quot;]">
    <Witness index="0" elapsed="00:00:00.0000027" param="_LFun0" function="GrammarRule.WitnessAll">
      <witness>
        <TopSpec examples="0" />
      </witness>
    </Witness>
    <Learn symbol="_LFun0" k="all" elapsed="00:00:00.0000184" randomK="null" retrievedFromCache="false" depth="6">
      <result>
        <Join rule="_LFun0 := \x =&gt; match" size="1">
          <Param name="match">
            <Join rule="True" size="1" />
          </Param>
        </Join>
      </result>
      <spec>
        <TopSpec examples="0" />
      </spec>
    </Learn>
    <Witness index="1" elapsed="00:00:00.0000023" param="nodes" function="GrammarRule.WitnessAll">
      <witness>
        <TopSpec examples="0" />
      </witness>
    </Witness>
    <Learn symbol="nodes" k="all" elapsed="00:00:00.0005651" randomK="null" retrievedFromCache="false" depth="6">
      <result>
        <Direct size="3" symbol="nodes">
          <Program i="0"><![CDATA[Children(tree)]]></Program>
          <Program i="1"><![CDATA[Descendants(tree)]]></Program>
          <Program i="2"><![CDATA[Single(tree)]]></Program>
        </Direct>
      </result>
      <spec>
        <TopSpec examples="0" />
      </spec>
    </Learn>
  </UsePlan>
</LearnRule>

I looked at the log for when MatchTag does succeed and it looks like the FilterWitnessingPlanExamples.WitnessSet gets called in the successful cases but not in the case above.

<LearnRule programCount="1" elapsed="00:00:00.1012893" rule="MatchNodes">
   ...
  <UsePlan verify="false" elapsed="00:00:00.1003024" plan="[&quot;FilterWitnessingPlanExamples.WitnessPredicate&quot;, &quot;FilterWitnessingPlanExamples.WitnessSet&quot;]" hasDependencies="true">
     <Witness index="1" elapsed="00:00:00.0034527" param="nodes" function="FilterWitnessingPlanExamples.WitnessSet">
             ... And so on

Do you think this is caused by Filter or lambda functions not supporting the DisjunctiveExampleSpec?