meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents

MIT License

261 stars 89 forks source link

feat: Adds German compound words decomposition with new segmenter #303

Closed luflow closed 2 months ago

luflow commented 3 months ago

Pull Request

What does this PR do?

Adds first version of decomposition for german compound words based on a dictionary (based on https://github.com/uschindler/german-decompounder/)
Adds benchmark with german sentences

PR checklist

Please check if your PR fulfills the following requirements:

[X] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
[X] Have you read the contributing guidelines?
[X] Have you made sure that the title is accurate and descriptive of the changes?

luflow commented 3 months ago

I assume this could be a very expensive algorithm because all word lengths are checked against the dict?

Not sure if there is a better solution, but at least a first version for compound words :)

luflow commented 3 months ago

Also another open question: can we even use the dictionary?

The orignal author has it under GNU GPL https://github.com/uschindler/german-decompounder/blob/master/NOTICE.txt

luflow commented 3 months ago

@curquiza @ManyTheFish fixed the fmt and clippy issues, Please rerun

luflow commented 3 months ago

Hi @ManyTheFish!

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

ManyTheFish commented 3 months ago

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

You can use the CLI fst-bin to build your dictionary from a source file. 😄

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

Yes you can build it from an iterator over str, so it's convenient

luflow commented 3 months ago

@ManyTheFish I extended the FstSegmenter with two options to also be able to handle a min lemma length and being able to hinder the segmenter from spitting out single letters. That keeps my dictionary even smaller and may be also useful for other languages later?

The dictionary is now also transformed into an FST file.

Let me know what you think :)

luflow commented 2 months ago

@ManyTheFish dud you find time yet to look over the changes? Do you need anything else from my side? :)

meili-bors[bot] commented 2 months ago

Build failed:

tests

luflow commented 2 months ago

@ManyTheFish ok applied suggestion :)

ManyTheFish commented 2 months ago

Hello @luflow,

the test and clippy are not happy,

could you ensure that:

cargo clippy
cargo test

work on your machine please?

I'll merge as soon as the tests pass 😃

luflow commented 2 months ago

@ManyTheFish done 👍🏻

meili-bors[bot] commented 2 months ago

Build succeeded: