languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.31k stars 1.39k forks source link

updating to Morfologik 2.1.0 #368

Closed jaumeortola closed 8 years ago

jaumeortola commented 8 years ago

@dweiss What is the new way to build a dictionary in Morfologik 2.1.0? We need to rewrite these lines in LanguageTool.

dweiss commented 8 years ago

Well, look at this tool: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-tools/src/main/java/morfologik/tools/DictCompile.java

A number of things have been made more strict when "compiling" dictionaries. Previously you'd feed an arbitrary input and it'd compile, now it's sanity-checked first to make sure dictionaries contain only valid entries.

If you need to compile from entries in memory (byte[]), I'd advise you to do a similar sanity-check on the content of those entries, prior to feeding them to the encoder. I don't know enough of LT to help much here, but the tests in Morfologik should cover API usage pretty well?

jaumeortola commented 8 years ago

Now I am able to compile LT with Morfologik 2.1.0. I have created a new branch with the changes (updatemorfologik), so they can be tested by other people. There is one thing to do: rewrite the methods in languagetool-tools. All of them (except "build with frequency data") are now provided by Morfologik (see below). What do you think is the best way to do it? @arysin, @danielnaber.

(From Morfologik-tools:)

Usage: [options] [command] [command options] Commands:

fsa_compile      Builds finite state automaton from \n-delimited input.
  Usage: fsa_compile [options]
    Options:
      --accept-bom
         Accept leading BOM bytes (UTF-8).
         Default: false
      --accept-cr
         Accept CR bytes in input sequences (\r).
         Default: false
      -f, --format
         Automaton serialization format.
         Default: FSA5
         Possible Values: [FSA5, CFSA2]
      --ignore-empty
         Ignore empty lines in the input.
         Default: false
    * -i, --input
         The input sequences (one sequence per \n-delimited line).
    * -o, --output
         The output automaton file.

fsa_decompile      Dumps all sequences encoded in an automaton.
  Usage: fsa_decompile [options]
    Options:
    * -i, --input
         The input automaton.
    * -o, --output
         The output file for byte sequences.

fsa_info      Print extra information about a compiled automaton file.
  Usage: fsa_info [options]
    Options:
    * -i, --input
         The input automaton.

dict_compile      Compiles a morphological dictionary automaton.
  Usage: dict_compile [options]
    Options:
      --accept-bom
         Accept leading BOM bytes (UTF-8).
         Default: false
      --accept-cr
         Accept CR bytes in input sequences (\r).
         Default: false
      -f, --format
         Automaton serialization format.
         Default: FSA5
         Possible Values: [FSA5, CFSA2]
      --ignore-empty
         Ignore empty lines in the input.
         Default: false
    * -i, --input
         The input file (base,inflected,tag). An associated metadata
         (*.info) file must exist.
      --overwrite
         Overwrite the output file if it exists.
         Default: false
      --validate
         Validate input to make sure it makes sense.
         Default: true

dict_decompile      Decompiles morphological dictionary automaton back to source state.
  Usage: dict_decompile [options]
    Options:
    * -i, --input
         The input dictionary (*.dict and a sibling *.info metadata).
      -o, --output
         The output file for dictionary data.
      --overwrite
         Overwrite the output file if it exists.
         Default: false
      --validate
         Validate decoded output to make sure it can be re-encoded.
         Default: true

dict_apply      Applies a dictionary to an input. Each line is considered an input term.
  Usage: dict_apply [options]
    Options:
    * -d, --dictionary
         The dictionary (*.dict and a sibling *.info metadata) to apply.
      -i, --input
         The input file, each entry in a single line. If not provided, stdin
         is used.
      --input-charset
         Character encoding of the input (platform's default).
      --skip-tags
         Skip tags in the output, only print base forms if found.
         Default: false
dweiss commented 8 years ago

Looked at the branch, it seems fine (again: no knowledge of LT whatsoever).

All of them (except "build with frequency data")

Where is this snippet of code? What is "frequency data"? I assume it's something encoded in the structure of the dictionary?

jaumeortola commented 8 years ago

We take the frequency of use of words from a XML file like this, we codify it in a byte and we put it (with a separator) after the word in spelling dictionaries (or after the POS tag in tagger diciontaries). This data is used for improving the suggestions of the Morfologik speller. The code is around here.Would it make sense to add this step to Morfologik-tools?

dweiss commented 8 years ago

I think it's something that is downstream from the notion of a "dictionary", so I'd rather not add it to the base tools.

Currently the "tag" is considered to be "metadata" that is associated with a term-inflected pair. It would make sense to encode this frequency into the "tag" part somehow and then compile it as a regular dictionary.

There are caveats. One is that the number of "separators" is now fixed and enforced -- this was done to prevent accidental errors in which odd bytes managed to slip to the input causing broken dictionaries. The other caveat is that the input is now "real" characters, read using a charset. So not all bytes are valid.

All this said, the dictionary is still a byte-based automaton so you could compile it as such (using {{fsa_compile}}) and then just attach metadata file. I would discourage this solution.

Rather than that, I would encode the "frequency" in a normalized form, say as characters between 'a' and 'z', where the each character corresponds to a proportional bucket in the frequency distribution. I doubt you need exact numbers there -- it's only meant for taking "the most likely" form, right? I implemented a suggester once based on the above idea and it worked really well. Just an idea.

jaumeortola commented 8 years ago

@dweiss

I'm a bit confused. There is a change in the input format of dictionaries. We were used to "inflected \t base \t tag", but now it is "base+inflected+tag" (with separator +). I'm going to list the different dictionaries we use. Let's see what is the input format of each one and the right tool for building them.

(The frequency data is already codified as a character between A and Z.)

Do you see any problem here? Is everything correct?

What is the difference between FSA5 and CFSA2? What should we use?

Once this is clear, I will be able to re-write the programs in languagetool-tools.

dweiss commented 8 years ago

The thing you should understand first is that a "dictionary" is actually a layer of interpretation of the data encoded in an automaton. So an entry:

base+inflected+tag

is actually encoded in an automaton as a sequence of bytes (with the inflected form preprocessed, but that's another story), there is no special meaning to the '+' symbol or anything. The DictionaryLookup class does overlay the meaning of a "separator" when it traverses the automaton. It does not need to be a '+' sign -- it can be any character (configured in dictionary properties).

I changed the default format of dict_compile to "base+inflected+tag" so that when you sort the input file all the base forms form a contiguous block. This helps in debugging and eyeballing the input file. Previously the same base form would be scattered around, which was misleading.

The difference in CFSA2 vs. FSA5 is in the way the automaton is written/ compressed. If you want to know the details I wrote a paper about it, but it is safe to assume that CFSA2 will produce smaller dictionaries at a slightly higher cost of traversing them.

I hope this addresses your bullet list of questions to some degree... I mean, there is no good answer to it. You can use fsa_build for all of these, or use dict_compile (and leave the inflected form/ tag empty). Whatever you use, the decision also affects how you then use these automata -- fsa_build creates an automaton which has to be processed using very low-level API. dict_compile creates a "higher level" dictionary for which the API is much more abstract... @milekpl may be more helpful here since I don't know how you're using morfologik in LT (and I'm currently on holidays).

danielnaber commented 8 years ago

@jaumeortola Can this issue be closed?