jeisner / treebank-scripts

Suite of scripts for preprocessing the Penn Treebank, primarily to extract lexical subcategorization frames and dependencies.
MIT License
7 stars 1 forks source link

write "liftheads" and "liftdeps" scripts #1

Open jeisner opened 8 years ago

jeisner commented 8 years ago

[item from the old TO-DO file dated 2002-04-07]

Write a liftheads script that turns

    (S (NP @(NNP @John)) @(VP @(VP @(VBZ @likes) (NP @(NNP @Mary))) (RB @tremendously)))

into

     (S|likes (NP|John @(NNP|John @)) @(VP|likes @(VP|likes @(VBZ|likes @) (NP|Mary @(NNP|Mary @))) (RB|tremendously @)))

Notice that we can then eliminate the -w option on flatten, using liftheads to postprocess instead. Anyway, then write a listdeps script that takes the output of liftheads (either flattened or not!) and lists the dependencies:

      S|likes   NP|John    <--- instead of mentioning parent S|likes, maybe mention sister VP|likes?  Or both?  But in the flattened case, the sister doesn't have a nonterminal type.
      VP|likes  NP|Mary
      VP|likes  RB|tremendously

the idea being that we list each type with the types of its non-head kids. Come to think of it, this would work even without having done liftheads, it's just that these nonlexicalized dependencies wouldn't be as interesting! Options could indicate on the dependents something about their position. A better format might be

         likes  S   John    NP

but we could get that by postprocessing. Anyway, the idea of getting this output is to be able to cluster words by the fillers of their dependent roles, or the dependent roles they fill for other particular words.