Decompound adds letters

marbleman commented 10 years ago

Hi,

I just got stuck with some "FetchPhaseExecutionException" when using the highlighting and the decomp filter:

InvalidTokenOffsetsException[Token verzinnte exceeds length of provided text sized 83]

Drilling down into that was a little tricky since the words causing the Exceptions did not occur in the indexed text! After a while I found the following:

Using decompound add some words to the index that are longer than the orignal:

e.g. for "Kupferleiter, verzinnt" it ads "verzinnt" AND "verzinnte" I have no clue what "verzinnte" is good for, but it sounds to me like the plural. However, since it is the last word in the text, highlighting fails because it exceeds the end of the text.

Here is an example analyzation of "verzinnt"

{ "tokens": [ { "token": "verzinnt", "start_offset": 0, "end_offset": 8, "type": "", "position": 1 }, { "token": "verzinnte", "start_offset": 0, "end_offset": 9, "type": "", "position": 1 } ] }

My guess: The end_offset: 9 is the problem here because the analyzed text is just 8 characters long. So when it comes to highlighting, the highlighter probably tries to to highlight "verzinnte" as well, which leads to the Exception...

jprante commented 10 years ago

Good catch. Decompound uses some probabilistics, but not at 100% reliability. "verzinnt" looks like it was not in the training set, so the algorithm fails.

Maybe it helps to reduce or increase the threshold parameter a little bit.

marbleman commented 10 years ago

Unfortunately changing the treshold even much more than a little bit does not seem to have any effect at all...

Is there a way to train the decompounder? Especially when it comes to technical vocabulary compounding of words sometimes becomes really insane such as "Aluminiumtiefziehteile" which should split into "Aluminium", "tiefziehen" and "Teil" and not into "Aluminium", "tief" and "ziehteile". In this case "tiefziehen" is still a compound word but must not be split into parts. Otherwise the context/original meaning gets lost.

I mean, it is already amazing to see the decomp and baseform filter in action together splitting words like "Straßenbahnschienenritzenreiniger". However, as in the example above it would be cool to train the decompounder to accept "Straßenbahn" as a word that must not be decompounded any further into "Straße" and "Bahn".

jprante commented 10 years ago

I started to rewrite the original trainer tool to let it run from command line but I got short on time.

The original tool is "ASV Toolbox Baseform" with a GUI-based trainer, available at http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/Baseforms%20Tool.htm

Because I just copied the trees in binary form, I don't know the original training set. By dumping the tree files, this training set can be reconstructed. If you can spend time, maybe you find a way to enhance the existing solution, I would appreciate it.

marbleman commented 10 years ago

I'd love to drill deeper here and enhance the solution but I am in doubt about choosing the right strategy. It is hard to tell wether a tool could use some enhancement or if I just have a lack of experience with elasticserach, e.g. not applying the right filters etc.

For example I cannot judge if training the decompounder is the way to go, or if it would be better to have a dictionary of compound words that must not be decompounded. I also figured out that there is a huge difference in the search result when I decompound the words myself before searching. It seems to me that "default_operator": "AND" does not apply to the words decompounded automatically. Instead I get results having part A OR B of a decompounded word. Maybe this is a real issue or maybe I just missed some analysis tweeks...

Right now I am prepraring a list of issues and examples showing what could improve the results from my point of view. Maybe you can comment on that when I am done so we can locate the issues to be addressed with some further investigation/coding/training.

Pictor13 commented 9 years ago

I am having the same issue with the plugin.

"InvalidTokenOffsetsException[Token l-ops exceeds length of provided text sized 93]"

Being the nature of the bug rather unpredictable I cannot forecast or prevent the exception (I also don't have control on the data to index).

Can you do something for this?

Is such a pity to just not using this plugin just for some few exceptions. It usually works really well! Please let us know. Thanks.

jprante / elasticsearch-analysis-decompound

Decompound adds letters #6