Closed luflow closed 2 months ago
I assume this could be a very expensive algorithm because all word lengths are checked against the dict?
Not sure if there is a better solution, but at least a first version for compound words :)
Also another open question: can we even use the dictionary?
The orignal author has it under GNU GPL https://github.com/uschindler/german-decompounder/blob/master/NOTICE.txt
@curquiza @ManyTheFish fixed the fmt and clippy issues, Please rerun
Hi @ManyTheFish!
Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣
Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?
Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣
You can use the CLI fst-bin to build your dictionary from a source file. 😄
Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?
Yes you can build it from an iterator over str, so it's convenient
@ManyTheFish I extended the FstSegmenter
with two options to also be able to handle a min lemma length and being able to hinder the segmenter from spitting out single letters. That keeps my dictionary even smaller and may be also useful for other languages later?
The dictionary is now also transformed into an FST file.
Let me know what you think :)
@ManyTheFish dud you find time yet to look over the changes? Do you need anything else from my side? :)
Build failed:
@ManyTheFish ok applied suggestion :)
Hello @luflow,
the test and clippy are not happy,
could you ensure that:
cargo clippy
cargo test
work on your machine please?
I'll merge as soon as the tests pass 😃
@ManyTheFish done 👍🏻
Build succeeded:
Pull Request
What does this PR do?
PR checklist
Please check if your PR fulfills the following requirements: