neubig / egret

A fork of the Egret parser that fixes a few bugs
10 stars 4 forks source link

Weird characters in grammar files? #2

Open zezke opened 8 years ago

zezke commented 8 years ago

When I compile egret and run the following command:

$ ./egret -lapcfg -i=testeng.txt -data=eng_grammar

I get this output:

( (NP^g (NP^g (NN story)) (PP^g (IN of) (NP^g (NN man)))))
( (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man)))))
( (S^g (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man)))) (VP^g (VBZ bites) (NP^g (NN dog)))))
( (S^g (NP^g (NP^g (DT the) (NN man)) (PP^g (IN of) (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man)))))) (VP^g (VBZ bites) (NP^g (NN dog)))))
( (S^g (NP^g (DT the) (NN man)) (VP^g (@VP^g (VBZ bites) (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man))))) (PP^g (IN like) (NP^g (NN dog))))))
( (S^g (NP^g (DT the) (NN dog)) (VP^g (VBZ bites) (NP^g (NP^g (DT the) (NN bone)) (PP^g (IN of) (NP^g (DT a) (NN man)))))))
( (NP^g (NP^g (DT the) (NN dog)) (PP^g (IN like) (NP^g (NP^g (DT the) (NN bone)) (PP^g (IN of) (NP^g (DT a) (NN man)))))))
( (NP^g (NP^g (DT the) (NN man)) (PP^g (IN like) (NP^g (DT a) (NN dog)))))
( (S^g (NP^g (DT the) (NN man)) (VP^g (VBZ bites) (NP^g (DT a) (NN dog)))))
( (S^g (NP^g (DT a) (NN man)) (VP^g (@VP^g (VBZ gives) (NP^g (DT the) (NN dog))) (NP^g (DT a) (NN bone)))))
all time:18.6939s
rule loading time:17.2935s

init binary rule time:0.155634s
init unary rule time:0.079887s

middle binary rule time:0.256627s
middle unary rule time0.384605s

final binary rule time:0.036625s
final unary rule time0.160108s

set unary node time:0s
query time:0s
inside binart rule ti

As you can see there are extra @ and ^g characters in the phrase structure output. I believe this originates from the English grammar files. An excerpt to show what I mean:

VP^g_0 -> VBP_0 ADVP^g_0 8.071531778304136E-4
VP^g_0 -> VBP_0 FRAG^g_0 6.974824753857925E-6
VP^g_0 -> VBP_0 NP^g_0 0.0129917721169252
VP^g_0 -> VBP_0 PP^g_0 0.0038893146592574646
VP^g_0 -> VBP_0 PRN^g_0 1.3731370047197033E-5
VP^g_0 -> VBP_0 PRT^g_0 1.3078595240373827E-4
VP^g_0 -> VBP_0 RB_0 1.4833211062545537E-4
VP^g_0 -> VBP_0 SBAR^g_0 0.007899304990428138
VP^g_0 -> VBP_0 SINV^g_0 6.673484317759013E-6
VP^g_0 -> VBP_0 S^g_0 0.005709293376162579
VP^g_0 -> VBP_0 UCP^g_0 1.0074575609880492E-4
VP^g_0 -> VBP_0 VP^g_0 0.02150708714627632
VP^g_0 -> VBZ_0 0.015932094778774906

This file already contains the extra characters. Could this be an error in the uploaded fiiles?

neubig commented 8 years ago

Thanks for the report! I believe this is the same as the original version of Egret (https://sites.google.com/site/zhangh1982/egret), so please check with the original author.

On Fri, Oct 2, 2015 at 4:25 AM, Bram Vandewalle notifications@github.com wrote:

When I compile egret and run the following command:

$ ./egret -lapcfg -i=testeng.txt -data=eng_grammar

I get this output:

( (NP^g (NP^g (NN story)) (PP^g (IN of) (NP^g (NN man))))) ( (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man))))) ( (S^g (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man)))) (VP^g (VBZ bites) (NP^g (NN dog))))) ( (S^g (NP^g (NP^g (DT the) (NN man)) (PP^g (IN of) (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man)))))) (VP^g (VBZ bites) (NP^g (NN dog))))) ( (S^g (NP^g (DT the) (NN man)) (VP^g (@VP^g (VBZ bites) (NP^g (NP^g (DT the) (NN story)) (PP^g (IN of) (NP^g (NN man))))) (PP^g (IN like) (NP^g (NN dog)))))) ( (S^g (NP^g (DT the) (NN dog)) (VP^g (VBZ bites) (NP^g (NP^g (DT the) (NN bone)) (PP^g (IN of) (NP^g (DT a) (NN man))))))) ( (NP^g (NP^g (DT the) (NN dog)) (PP^g (IN like) (NP^g (NP^g (DT the) (NN bone)) (PP^g (IN of) (NP^g (DT a) (NN man))))))) ( (NP^g (NP^g (DT the) (NN man)) (PP^g (IN like) (NP^g (DT a) (NN dog))))) ( (S^g (NP^g (DT the) (NN man)) (VP^g (VBZ bites) (NP^g (DT a) (NN dog))))) ( (S^g (NP^g (DT a) (NN man)) (VP^g (@VP^g (VBZ gives) (NP^g (DT the) (NN dog))) (NP^g (DT a) (NN bone))))) all time:18.6939s rule loading time:17.2935s

init binary rule time:0.155634s init unary rule time:0.079887s

middle binary rule time:0.256627s middle unary rule time0.384605s

final binary rule time:0.036625s final unary rule time0.160108s

set unary node time:0s query time:0s inside binart rule ti

As you can see there are extra @ and ^g characters in the phrase structure output. I believe this originates from the grammar files. An excerpt to show what I mean:

VP^g_0 -> VBP_0 ADVP^g_0 8.071531778304136E-4 VP^g_0 -> VBP_0 FRAG^g_0 6.974824753857925E-6 VP^g_0 -> VBP_0 NP^g_0 0.0129917721169252 VP^g_0 -> VBP_0 PP^g_0 0.0038893146592574646 VP^g_0 -> VBP_0 PRN^g_0 1.3731370047197033E-5 VP^g_0 -> VBP_0 PRT^g_0 1.3078595240373827E-4 VP^g_0 -> VBP_0 RB_0 1.4833211062545537E-4 VP^g_0 -> VBP_0 SBAR^g_0 0.007899304990428138 VP^g_0 -> VBP_0 SINV^g_0 6.673484317759013E-6 VP^g_0 -> VBP_0 S^g_0 0.005709293376162579 VP^g_0 -> VBP_0 UCP^g_0 1.0074575609880492E-4 VP^g_0 -> VBP_0 VP^g_0 0.02150708714627632 VP^g_0 -> VBZ_0 0.015932094778774906

This file already contains the extra characters. Could this be an error in the uploaded fiiles?

— Reply to this email directly or view it on GitHub https://github.com/neubig/egret/issues/2.

zezke commented 8 years ago

I've contacted the original author, I will update this issue if any progress is made.