repodiac / german_compound_splitter

Compound splitter for German language ("Komposita-Zerlegung") based on large dictionary combined with highly efficient multi-pattern string search
Creative Commons Attribution 4.0 International
22 stars 2 forks source link
compounds deutsch german komposita kompositum nlp splitter

german_compound_splitter

german_compound_splitter is a Python module to split up German compound words (for instance the famous Donaudampfschifffahrtskapitänsmützenabzeichen), based on a given external German dictionary and a highly-efficient implementation of the Aho–Corasick algorithm, pyahocorasick, for multi-pattern string search and retrieval.

This method works really well compared to the available other alternatives I know of for German compound splitting. Depending on your use case you may want normalized forms of the resulting components (split words), most likely the singular form. You can try make_singular=True, it works for me in most cases.

Sometimes the result has artifacts -- it may be that the dictionary lacks certain entries at this point or this is a bug :-) Nonetheless, you can play around with the parameter only_nouns=False which allows for any type of word (verb, adjective, prefix, suffix) in the dictionary to be used for splitting. Sometimes the results are good or better than if limiting to nouns only.

Also depending on the use case is the option to mark everything which is not found in the dictionary to be masked as "unknown" split word in the results. Otherwise and per default, the compound splitter tries to workaround these issues or simply copies the unknown part directly to the results. You can see that if you use with only_nouns=True and a split word is lower-case though. This is most likely an unknown part not found in the dictionary or interfering with other words from the dictionary.

This is also true for words where you wonder why they haven't been split up appropriately (even for "simple" words this may occur). At the time of writing this README, some very basic words like "Zeit" or "Fahrt" are not in the dictionary (either by mistake of the author or for other reasons).

License and Attribution

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

To provide attribution or cite this work please use the following text snippet:

german_compound_splitter, Copyright 2020 by repodiac, see https://github.com/repodiac for updates and further information

Version History

Installation/Setup

Installation is easy using pip and built-in git package installation based on setup.py:

Setup:

Some Notes on an External Dictionary

Due to unclear license and depending on the kind of use (private, research, commercial, ...) I cannot include a dictionary here in this setting. I strongly recommend the Free German Dictionary by Jan Schreiber, though. It is constantly or often updated and includes currently more than 2.1 million entries! But apart from this, you can use any dictionary containing one item per line.

Note:

Documentation

Example Usage

In Python code or as library:

from german_compound_splitter import comp_split

compound = 'Donaudampfschifffahrtskapitänsmützenabzeichen'
# please load an appropriate (external) dictionary, see the notes in section Installation/Setup on the dictionary
input_file = 'german.dic'
ahocs = comp_split.read_dictionary_from_file(input_file)

dissection = comp_split.dissect(compound, ahocs, make_singular=True)
print('SPLIT WORDS (plain):', dissection)
print('SPLIT WORDS (post-merge):', comp_split.merge_fractions(dissection))

Input Parameters

The main method to be used is dissect('Kompositumszerlegungsmaschinerie', ahocs)

It has the following input parameters:

Performance

If you have multiple splittings (or multiple compound words) to process, it is advisable to load the dictionary only once and reuse the ahocorasick automaton every other time -- loading the dictionary can take a few seconds depending on your machine. However, the actual split up work is blazing fast, thanks to pyahocorasick's implementation and you can easily process a lot of compound words in very little time.

Issues and Comments

Please open issues on github for bugs or feature requests. You can also reach out to me via email.