kuhumcst / cstlemma

Lemmatiser for Danish, Dutch, English, German, Polish, Romanian, Russian and tens of other languages, that uses affix rules (affix: prefix, infix, suffix, circumfix). Rules are obtained by supervised learning from a full form - lemma list.
GNU General Public License v2.0
35 stars 7 forks source link

Further problems with input format #6

Open jpiitula opened 6 years ago

jpiitula commented 6 years ago

I encountered new problems with the input format option since the previous issue (many thanks for the prompt fixing of that). Briefly:

  1. tags don't seem to match dictionary entries when using input format (dictionary matching works when using slash-separated format and -t option)
  2. reading from stdin with an input format seems to produce a final ghost entry of empty "word" and empty "tag" (this is my guess; what is observed is three trailing lines in output: empty line, tab on line, empty line)

The same problems occur both with -I '$w/$t\n' on slash-separated tokens and with -I '$w\t$t\n' on tab-separated tokens. (I will need to use tab. The actual model has slashes in some tags.) (I saw indications that the tags or the third "word" are considered "unknown" when I played with output formats in earlier experiments but this is not in the attached logs.)

While investigating this I also saw that the STREAM==0 version of cstlemma sends its diagnostics to stdout (when writing to a file) or nowhere (when writing to stdout). Surely it would be better to use stderr. But this is just by the way.

I attach a summary of my experiments, a shell log, the compilation script (fresh clones yesterday, compiled with and without STREAM), the test script, and an archive containing the two input files (slashed, tabbed) and the different output files (correct lemmas accompanying tags when using -t, incorrect lemmas, incorrect lemmas with the unexpected trail of empty-looking lines when using appropriate -I from file or from stdin). Hope some of these are useful. (The dictionary is from Språkbanken's sparv distribution, I'm not attaching that yet, but see test log for examples of the format.)

sum.txt log.txt test.sh.txt compile.sh.txt input-output.zip

BartJongejan commented 6 years ago

Version 7.35 solves (I think) two problems: warnings and errors are sent to stderr, not stdout, and spurious empty lines near the end of the output are suppressed.

jpiitula commented 6 years ago

Running the same compilation and test scripts with 7.35, I now get segmentation fault (core dumped) from the streaming version when using the -I option. This happens with both input formats (word SLASH tag NL, word TAB tag NL) and their appropriate inputs, regardless of whether the input comes from a file or from stdin.

The default version, compiled with STREAM set to 0, does not dump core with either -I format.

So I have eight core files (5.6M or 9.6M each) but I don't know what to do with them, or what other information would be relevant. Gcc version is 4.8.2; readelf -d lists libstdc++.so.6, libm.so.6, libgcc_s.so.1, libc.so.6 as the needed shared libraries. Is there something I can check on my end?

jpiitula commented 6 years ago

I made the attached table that compares the file sizes out of the previous version (7.34) and the new version (7.35). That seems to reveal that the newly core-dumping cases are exactly those that had the spurious trail of space before. Unless I'm getting confused with the combinatorics.

I will test if the segfault happens early or late.

regress.txt

jpiitula commented 6 years ago

Ok, it (STREAM as 1 with -I format) appears to segfault early. The test is by sending it so much input that it would have produced some output before crashing if the crash only happened at end:

$ seq 4000000 | xargs -I{} cat slashed | cstlemma1/cstlemma -I '$w/$t\n' -d models/dict0 -f empty 2> /dev/null
xargs: cat: terminated by signal 13
Segmentation fault (core dumped)

With input format specified as -t it produces the expected output:

$ seq 2 | xargs -I{} cat slashed | cstlemma1/cstlemma -t -d models/dict0 -f empty 2> /dev/null
kommer  komma   VB.PRS.AKT
kommer  kommer  UO
kommer  komma   VB.PRS.AKT
kommer  kommer  UO

But I cannot use -t because some of the tags contain slashes. I wish to use tabs.

BartJongejan commented 6 years ago

There is a new version on GitHub

jpiitula commented 6 years ago

Thanks. All my test combinations produce proper result now. All combinations that use -t produce the correct result. Those that use an -I format produce the incorrect result (fail to recognize that the word form with that tag is in the dictionary). Trailing space is gone as you promised.

BartJongejan commented 6 years ago

I am glad to know cstlemma works better now. If the lemmatizer must take PoS-tags into account, then use -t, even if there also is a -I input format that specifies that there are PoS-tags in the input. Without -t, the lemmatizer defaults to -t-, and ignores PoS-information in the input. (If the PoS-tagger isn't very good, ignoring the PoS-tags may in fact give better lemmatization results.)

jpiitula commented 6 years ago

Aha, I had misunderstood -t. Sorry about that. With the recent fixes, and with -t added to those test cases that use -I, all my current tests work now, including the cases that I most need to work.

So I think this issue is solved. Thank you.