jeisner / treebank-scripts

Suite of scripts for preprocessing the Penn Treebank, primarily to extract lexical subcategorization frames and dependencies.
MIT License
7 stars 1 forks source link

check Treebank for suspicious configurations #16

Open jeisner opened 8 years ago

jeisner commented 8 years ago

[item from the old TO-DO file dated 2002-04-07]

check for other things [in the Treebank] that look like bugs. For example, I saved the following snippet that detects indexed traces that aren't bound.

[1/4/98.  Find all the traces in Treebank II, their nonterminal
categories, and the nonterminal categories they end up as.  See
~/tmp/move-categories and ~/tmp/move-categories-summary.]

~/hw/learn/02-subcat-study/extract/oneline -n ~/info/wsj/*/* | perl5 -e '$token = "[^ \t\n()]+"; $ind = "-[0-9]+\\b"; $tokennoind = "(?:(?!$ind)[^ \t\n()])+"; while (<>) { s/^(\S+:[0-9]+:\t)?//, $location = $&; while (/\(($tokennoind)(?:$ind)? \(-NONE- ($tokennoind)($ind)/og) { print "$location$2 $1 "; if (/\(($tokennoind)$3 /) { print "$1\n" } else {print "not_found\n" }}}' | sort -k 2 | uniq -f 1 -c