snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Add a script that replaces Latin chars with Unicode letters #174

Closed abratashov closed 1 year ago

abratashov commented 1 year ago

Adds a script that replaces Latin chars with Unicode letters that facilitates reading the Snowball file. The script produces a readable sbl file, that allows printing it out for human reading and exploring algorithm, for ex:

$ bin/readable_sbl ./algorithms/greek.sbl
// A stemmer for Modern Greek language, based on:
//...

So, instead of:

  //...
  define step6 as (
    do (
      [substring] among (
        '{m}{a}{t}{a}' '{m}{a}{t}{oo}{n}' '{m}{a}{t}{o}{s}' (<- '{m}{a}')
      )
    )
    test1
    [substring] among (
      '{a}' '{a}{g}{a}{t}{e}' '{a}{g}{a}{n}' '{a}{e}{y}' '{a}{m}{a}{y}' '{a}{n}' '{a}{s}' '{a}{s}{a}{y}' '{a}{t}{a}{y}' '{a}{oo}' '{e}' '{e}{y}'
      '{e}{y}{s}' '{e}{y}{t}{e}' '{e}{s}{a}{y}' '{e}{s}' '{e}{t}{a}{y}' '{y}' '{y}{e}{m}{a}{y}' '{y}{e}{m}{a}{s}{t}{e}' '{y}{e}{t}{a}{y}' '{y}{e}{s}{a}{y}'
      '{y}{e}{s}{a}{s}{t}{e}' '{y}{o}{m}{a}{s}{t}{a}{n}' '{y}{o}{m}{o}{u}{n}' '{y}{o}{m}{o}{u}{n}{a}' '{y}{o}{n}{t}{a}{n}' '{y}{o}{n}{t}{o}{u}{s}{a}{n}' '{y}{o}{s}{a}{s}{t}{a}{n}'
      '{y}{o}{s}{a}{s}{t}{e}' '{y}{o}{s}{o}{u}{n}' '{y}{o}{s}{o}{u}{n}{a}' '{y}{o}{t}{a}{n}' '{y}{o}{u}{m}{a}' '{y}{o}{u}{m}{a}{s}{t}{e}' '{y}{o}{u}{n}{t}{a}{y}'
      '{y}{o}{u}{n}{t}{a}{n}' '{i}' '{i}{d}{e}{s}' '{i}{d}{oo}{n}' '{i}{th}{e}{y}' '{i}{th}{e}{y}{s}' '{i}{th}{e}{y}{t}{e}' '{i}{th}{i}{k}{a}{t}{e}' '{i}{th}{i}{k}{a}{n}'
      '{i}{th}{o}{u}{n}' '{i}{th}{oo}' '{i}{k}{a}{t}{e}' '{i}{k}{a}{n}' '{i}{s}' '{i}{s}{a}{n}' '{i}{s}{a}{t}{e}' '{i}{s}{e}{y}' '{i}{s}{e}{s}' '{i}{s}{o}{u}{n}'
      '{i}{s}{oo}' '{o}' '{o}{y}' '{o}{m}{a}{y}' '{o}{m}{a}{s}{t}{a}{n}' '{o}{m}{o}{u}{n}' '{o}{m}{o}{u}{n}{a}' '{o}{n}{t}{a}{y}' '{o}{n}{t}{a}{n}'
      '{o}{n}{t}{o}{u}{s}{a}{n}' '{o}{s}' '{o}{s}{a}{s}{t}{a}{n}' '{o}{s}{a}{s}{t}{e}' '{o}{s}{o}{u}{n}' '{o}{s}{o}{u}{n}{a}' '{o}{t}{a}{n}' '{o}{u}' '{o}{u}{m}{a}{y}'
      '{o}{u}{m}{a}{s}{t}{e}' '{o}{u}{n}' '{o}{u}{n}{t}{a}{y}' '{o}{u}{n}{t}{a}{n}' '{o}{u}{s}' '{o}{u}{s}{a}{n}' '{o}{u}{s}{a}{t}{e}' '{u}' '{u}{s}' '{oo}'
      '{oo}{n}' (delete)
    )
  )

  define step7 as (
    [substring] among (
      '{e}{s}{t}{e}{r}' '{e}{s}{t}{a}{t}' '{o}{t}{e}{r}' '{o}{t}{a}{t}' '{u}{t}{e}{r}' '{u}{t}{a}{t}' '{oo}{t}{e}{r}' '{oo}{t}{a}{t}' (delete)
    )
  )
  //...

=>

  define step6 as (
    do (
      [substring] among (
        'ματα' 'ματων' 'ματοσ' (<- 'μα')
      )
    )
    test1
    [substring] among (
      'α' 'αγατε' 'αγαν' 'αει' 'αμαι' 'αν' 'ασ' 'ασαι' 'αται' 'αω' 'ε' 'ει'
      'εισ' 'ειτε' 'εσαι' 'εσ' 'εται' 'ι' 'ιεμαι' 'ιεμαστε' 'ιεται' 'ιεσαι'
      'ιεσαστε' 'ιομασταν' 'ιομουν' 'ιομουνα' 'ιονταν' 'ιοντουσαν' 'ιοσασταν'
      'ιοσαστε' 'ιοσουν' 'ιοσουνα' 'ιοταν' 'ιουμα' 'ιουμαστε' 'ιουνται'
      'ιουνταν' 'η' 'ηδεσ' 'ηδων' 'ηθει' 'ηθεισ' 'ηθειτε' 'ηθηκατε' 'ηθηκαν'
      'ηθουν' 'ηθω' 'ηκατε' 'ηκαν' 'ησ' 'ησαν' 'ησατε' 'ησει' 'ησεσ' 'ησουν'
      'ησω' 'ο' 'οι' 'ομαι' 'ομασταν' 'ομουν' 'ομουνα' 'ονται' 'ονταν'
      'οντουσαν' 'οσ' 'οσασταν' 'οσαστε' 'οσουν' 'οσουνα' 'οταν' 'ου' 'ουμαι'
      'ουμαστε' 'ουν' 'ουνται' 'ουνταν' 'ουσ' 'ουσαν' 'ουσατε' 'υ' 'υσ' 'ω'
      'ων' (delete)
    )
  )

  define step7 as (
    [substring] among (
      'εστερ' 'εστατ' 'οτερ' 'οτατ' 'υτερ' 'υτατ' 'ωτερ' 'ωτατ' (delete)
    )
  )
abratashov commented 1 year ago

@ojwb Please review this PR, I've written this script for exploring Russian stemmer. Also, I've checked it for all existing *.sbl it works pretty well! Hope it will be useful for future newcomers and contributors to Snowball :)

ojwb commented 1 year ago

I can understand the motivation, but Snowball doesn't really support literal Unicode strings in source code and it seems confusing to provide a script to change a valid snowball program into an invalid one.

That problem could be avoided, e.g. the script could add a comment before each line with stringdefs showing the same line with the stringdefs decoded, rather than rewriting the code itself.

If the current stringdef approach is less readable than literal Unicode would be then maybe we should look at supporting literal Unicode strings. That may not result in a more readable program for some code, but then stringdef would still be an option as we'd need to support it for compatibility anyway. There may be a difference between what's readable to someone who can read the language being stemmed and someone who can't though.

BTW, a trick to show the program's form with stringdefs decoded is:

./snowball -utf8 algorithms/russian.sbl -o tmp -syntax
abratashov commented 1 year ago

I've created a separate repository with some useful tools for Snowball developing https://github.com/abratashov/snowball_tools, so I'm closing this PR. As for me, it would be nice if we could keep both versions of Unicode and non-Unicode (curly brackets) versions, or some scripts that could convert it to each other.