Open jpiitula opened 6 years ago
Version 7.35 solves (I think) two problems: warnings and errors are sent to stderr, not stdout, and spurious empty lines near the end of the output are suppressed.
Running the same compilation and test scripts with 7.35, I now get segmentation fault (core dumped) from the streaming version when using the -I option. This happens with both input formats (word SLASH tag NL, word TAB tag NL) and their appropriate inputs, regardless of whether the input comes from a file or from stdin.
The default version, compiled with STREAM set to 0, does not dump core with either -I format.
So I have eight core files (5.6M or 9.6M each) but I don't know what to do with them, or what other information would be relevant. Gcc version is 4.8.2; readelf -d lists libstdc++.so.6, libm.so.6, libgcc_s.so.1, libc.so.6 as the needed shared libraries. Is there something I can check on my end?
I made the attached table that compares the file sizes out of the previous version (7.34) and the new version (7.35). That seems to reveal that the newly core-dumping cases are exactly those that had the spurious trail of space before. Unless I'm getting confused with the combinatorics.
I will test if the segfault happens early or late.
Ok, it (STREAM as 1 with -I format) appears to segfault early. The test is by sending it so much input that it would have produced some output before crashing if the crash only happened at end:
$ seq 4000000 | xargs -I{} cat slashed | cstlemma1/cstlemma -I '$w/$t\n' -d models/dict0 -f empty 2> /dev/null
xargs: cat: terminated by signal 13
Segmentation fault (core dumped)
With input format specified as -t it produces the expected output:
$ seq 2 | xargs -I{} cat slashed | cstlemma1/cstlemma -t -d models/dict0 -f empty 2> /dev/null
kommer komma VB.PRS.AKT
kommer kommer UO
kommer komma VB.PRS.AKT
kommer kommer UO
But I cannot use -t because some of the tags contain slashes. I wish to use tabs.
There is a new version on GitHub
Thanks. All my test combinations produce proper result now. All combinations that use -t produce the correct result. Those that use an -I format produce the incorrect result (fail to recognize that the word form with that tag is in the dictionary). Trailing space is gone as you promised.
I am glad to know cstlemma works better now. If the lemmatizer must take PoS-tags into account, then use -t, even if there also is a -I input format that specifies that there are PoS-tags in the input. Without -t, the lemmatizer defaults to -t-, and ignores PoS-information in the input. (If the PoS-tagger isn't very good, ignoring the PoS-tags may in fact give better lemmatization results.)
Aha, I had misunderstood -t. Sorry about that. With the recent fixes, and with -t added to those test cases that use -I, all my current tests work now, including the cases that I most need to work.
So I think this issue is solved. Thank you.
I encountered new problems with the input format option since the previous issue (many thanks for the prompt fixing of that). Briefly:
The same problems occur both with -I '$w/$t\n' on slash-separated tokens and with -I '$w\t$t\n' on tab-separated tokens. (I will need to use tab. The actual model has slashes in some tags.) (I saw indications that the tags or the third "word" are considered "unknown" when I played with output formats in earlier experiments but this is not in the attached logs.)
While investigating this I also saw that the STREAM==0 version of cstlemma sends its diagnostics to stdout (when writing to a file) or nowhere (when writing to stdout). Surely it would be better to use stderr. But this is just by the way.
I attach a summary of my experiments, a shell log, the compilation script (fresh clones yesterday, compiled with and without STREAM), the test script, and an archive containing the two input files (slashed, tabbed) and the different output files (correct lemmas accompanying tags when using -t, incorrect lemmas, incorrect lemmas with the unexpected trail of empty-looking lines when using appropriate -I from file or from stdin). Hope some of these are useful. (The dictionary is from Språkbanken's sparv distribution, I'm not attaching that yet, but see test log for examples of the format.)
sum.txt log.txt test.sh.txt compile.sh.txt input-output.zip