jeisner / treebank-scripts

Suite of scripts for preprocessing the Penn Treebank, primarily to extract lexical subcategorization frames and dependencies.
MIT License
7 stars 1 forks source link

inefficiency in predict.inc #4

Open jeisner opened 8 years ago

jeisner commented 8 years ago

[item from the old TO-DO file dated 2002-04-07]

In predict.inc, there are many substitutions that just replace a pattern with itself, in order to count the number of matches. That is inefficient. Instead, try e.g.

   $count = () = (m/$string/g);
   $count=0; m/$string(?{ ++$count })/g;

(see comp.lang.perl.moderated thread, "Counting matches in a regex")