ufal / morphodita

MorphoDiTa: Morphologic Dictionary and Tagger
Mozilla Public License 2.0
69 stars 7 forks source link

Dictionary fails to load with perl bindings #8

Closed dlukes closed 8 years ago

dlukes commented 8 years ago

When trying to load a custom dictionary using the perl bindings, ...

my $morpho = Morpho::load($path);

... the attempt fails with the following error:

perl: morpho/morpho_dictionary.h:109: void ufal::morphodita::morpho_dictionary<LemmaAddinfo>::load(ufal::morphodita::binary_decoder&) [with LemmaAddinfo = ufal::morphodita::czech_lemma_addinfo]: Assertion `lemma_offset < (1<<24) && lemma_len < (1<<8)' failed.
[1]    179285 abort (core dumped)

The dictionary was encoded using MorphoDiTa v1.3.0, the bindings are installed from CPAN and I've also tried compiling them manually but the result is the same. run_morpho_analyze loads the dictionary without problems.

I'd be grateful for any possible pointers as to what might be the problem! :)

foxik commented 8 years ago

That is definitely weird :-)

If it is possible for you to share the dictionary, that would make debugging much easier (you can use my email if you like).

Both the Perl bindings and run_morpho_analyze use the same source to load the dictionary, so in order to give different results, they probably have to be compiled with different compiler / compiler options. If you compile both perl bindings and run_morpho_analyze from the same source tree, do they still give different results?

dlukes commented 8 years ago

Thanks, I'll send you a download link :) run_morpho_analyze is the pre-compiled binary from this repository, but I'll try compiling it myself and see what happens.

foxik commented 8 years ago

Thanks for the dictionary, I understand the issue now.

The problem is that some internal limit is indeed exceeded -- however, binary release of MorphoDiTa until 1.3 (not 1.9 and current HEAD) remove the asserts from the binary. The asserts should not be removed (and they are not removed in language bindings and when compiling MorphoDiTa manually and in 1.9 MorphoDiTa binary releases). Therefore, even if run_morpho_analyze from MorphoDiTa 1.3 binary release does load the model, some operations will return bogus lemmas (or crash).

I will think how to alleviate the internal limit without hurting performance. Until it happens, the only workaround I can think of is to use a smaller dictionary (perhaps two smaller dictionaries instead of one?).

BTW, the internal limit is only on the memory structure used to search the dictionary, the dictionary file itself is fine.

dlukes commented 8 years ago

OK, thanks a lot for clearing this up! :)

foxik commented 8 years ago

I increased the internal limit and now your dictionary can be loaded. Every loaded dictionary now takes ~5-8% memory, but that is acceptable.

BTW, your dictionary requires approximately 3 times more memory than the regular morfflex dictionary, which is why you were able to exceed the limit :-)

foxik commented 8 years ago

FYI, I will publish a new release of MorphoDiTa containing the fix soon.

foxik commented 8 years ago

MorphoDiTa 1.9.1 released.

dlukes commented 8 years ago

BTW, your dictionary requires approximately 3 times more memory than the regular morfflex dictionary, which is why you were able to exceed the limit :-)

Huh? Wonder why that is :) It's not that much larger than the original one (~ 100s of MB) in source format... It's probably less repetitive → less finite-state compressible?

MorphoDiTa 1.9.1 released.

Awesome, thanks so much for the great work once again!

foxik commented 8 years ago

Huh? Wonder why that is :) It's not that much larger than the original one (~ 100s of MB) in source format... It's probably less repetitive → less finite-state compressible?

I looked into it and something really went wrong during dictionary creation. Your dictionary contains 5.8M lemmas (compared to ~1M lemmas in czech-morfflex-*), but most of them are just duplicated (which should not happen, of course).

My tip is that you did not generate the input to encode_dictionary correctly -- according to documentation, all forms for one lemma must be in one continuous region (and also no lines must be duplicated, so the docs suggest to use sort -u to achieve both). So maybe you used some script to preprocess the input, the script modified the lemmas (maybe by duplicating some lines and changing lemmas on them), but did not sort the output before running encode_dictionary?

We do not check this condition nor sort the input file because it is huge (~7GB for morfflex). But maybe we could store unique lemmas and check that a lemma is not duplicated once it ended in the source. If my tip is indeed what happened, I will add the check so that you will be warned next time this happens. Also if my tip is correct, it means that enlarging the internal limit was not really necessary :-)

dlukes commented 8 years ago

You're exactly correct that I'm using a script to add / remove some lemmas prior to encoding the dictionary, but the sort -u call is right there in the pipeline, as per the documentation... (I even have a vague recollection of forgetting to put it there in one of my earlier experiments and MorphoDiTa exiting with an error, but that must be a false memory if you say MorphoDiTa currently doesn't check this.)

Weird. Maybe the sorting fails silently in some way? I'll need to investigate further. Thanks a lot for diagnosing this! :)

foxik commented 8 years ago

@dlukes Please note that I did not test 1.9.1 thoroughly and there is a regression there (morpho::generate may return less results than present in the dictionary). So please update to 1.9.2, sorry.

As for the sort -u present in your pipeline -- weird. (The check for repeating lemmas was not in MorphoDiTa until 1.9.2).

Could you please try using MorphoDiTa 1.9.2 in your original pipeline, so we are sure the "repeated lemmas on the input of encode_dictionary" issue is not the cause of the problem? Your dictionary is not constructed correctly (you can see this by using run_morpho_cli from MorphoDiTa 1.9.2. on your dictionary and inputting "oplatka[TAB][ENTER]", you will see that lemma 'oplatka' is repeated, which it shouldn't be [you can compare to the output of czech-morfflex-160310.dict]). We should find out what went wrong during the dictionary creation. Thanks :-)

dlukes commented 8 years ago

Sorry for replying so late, I had a lot of other stuff on my plate in the meantime :) I'll do as you suggest and report back!

dlukes commented 8 years ago

OK, finally found some time to get back to this :) Wish I'd done it earlier, the issue turned out to be trivial: locales. With my default English UTF-8 locale (added explicitly here for replicability):

$ echo -e "aafia a\naafia b\naafía a\naafía b" | LC_ALL=en_US.utf-8 sort -u
aafia a
aafía a
aafia b
aafía b

The result is the same with the Czech UTF-8 locale. The solution (as so many times before): use LC_ALL=C.

$ echo -e "aafia a\naafia b\naafía a\naafía b" | LC_ALL=C sort -u
aafia a
aafia b
aafía a
aafía b

At any rate, it's a great help that MorphoDiTa now alerts the user that there's something wrong with the dictionary ordering!

foxik commented 8 years ago

Thank you very much for investigation, it is good we know why the problem happened and that it will not repeat itself in the future :-)