rug-compling / Alpino

Alpino parser and related tools for Dutch
GNU Lesser General Public License v2.1
22 stars 2 forks source link

</sentence> sentid="" output when assume_input_is_tokenized=on #7

Open oktaal opened 3 years ago

oktaal commented 3 years ago

When I modify the Makefile.start_server script

https://github.com/rug-compling/Alpino/blob/7a2ea6e2d8f7ae320a021b9e4ed1131a69d5a5a5/Makefile.start_server#L10

and change _assume_input_istokenized=off to _assume_input_istokenized=on the output becomes malformed.

For example:

$ make -f Makefile.start_server 
PROLOGMAXSIZE=1500M /opt/Alpino-git233/bin/Alpino -notk -veryfast user_max=20000\
            server_kind=parse\
            server_port=42424\
            assume_input_is_tokenized=on\
            debug=1\
            -init_dict_p\
            batch_command=alpino_server\
        2> /alpino_server.log &

$ telnet localhost 42424
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hallo wereld .
top/top|top/hd|hallo/[0,1]|127.0.0.1
hallo/[0,1]|tag/nucl|wereld/[1,2]|127.0.0.1
/[2,3]|127.0.0.1app|.
<?xml version="1.0" encoding="UTF-8"?>
<alpino_ds version="1.6">
  <parser build="Alpino-x86_64-linux-glibc2.5-git233-sicstus" date="2021-02-04T16:52" cats="1" skips="0" />
  <node begin="0" cat="top" end="3" id="0" rel="top">
    <node begin="0" cat="du" end="3" id="1" rel="--">
      <node begin="0" end="1" frame="tag" his="normal" his_1="normal" id="2" lcat="advp" lemma="hallo" pos="tag" postag="TSW()" pt="tsw" rel="tag" root="hallo" sense="hallo" word="hallo"/>
      <node begin="1" cat="np" end="3" id="3" rel="nucl">
        <node begin="1" end="2" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" his="normal" his_1="normal" id="4" lcat="np" lemma="wereld" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" rnum="sg" root="wereld" sense="wereld" word="wereld"/>
"/>pecial="hoofd" word=".m" positie="vrij" postag="TW(hoofd,vrij)" pt="tw" rel="app" root=".ssion" id="5" infl="both" lcat="detp" lemma=".
      </node>
    </node>
  </node>
</sentence> sentid="127.0.0.1">hallo wereld .
</alpino_ds>
Connection closed by foreign host.

Keeping assume_input_is_tokenized to off does give a correctly formatted sentence item: <sentence sentid="127.0.0.1">hallo wereld .</sentence>.

I have to implement a work-around here anyway to support older Alpino-versions, so this isn't an issue for me. But I was wondering if there might be some setting I'm missing here to prevent this from happening? I couldn't figure out where in the Alpino-code this goes wrong.