Closed abratashov closed 1 year ago
@ojwb Please review this PR, I've written this script for exploring Russian stemmer.
Also, I've checked it for all existing *.sbl
it works pretty well!
Hope it will be useful for future newcomers and contributors to Snowball :)
I can understand the motivation, but Snowball doesn't really support literal Unicode strings in source code and it seems confusing to provide a script to change a valid snowball program into an invalid one.
That problem could be avoided, e.g. the script could add a comment before each line with stringdefs showing the same line with the stringdefs decoded, rather than rewriting the code itself.
If the current stringdef approach is less readable than literal Unicode would be then maybe we should look at supporting literal Unicode strings. That may not result in a more readable program for some code, but then stringdef
would still be an option as we'd need to support it for compatibility anyway. There may be a difference between what's readable to someone who can read the language being stemmed and someone who can't though.
BTW, a trick to show the program's form with stringdefs decoded is:
./snowball -utf8 algorithms/russian.sbl -o tmp -syntax
I've created a separate repository with some useful tools for Snowball developing https://github.com/abratashov/snowball_tools, so I'm closing this PR. As for me, it would be nice if we could keep both versions of Unicode and non-Unicode (curly brackets) versions, or some scripts that could convert it to each other.
Adds a script that replaces Latin chars with Unicode letters that facilitates reading the Snowball file. The script produces a readable
sbl
file, that allows printing it out for human reading and exploring algorithm, for ex:So, instead of:
=>