giellalt / template-lang-und

A template repo for new languages, as well as to update existing language repos with.
https://giellalt.uit.no/
GNU Lesser General Public License v3.0
2 stars 1 forks source link

Typos speller evaluation + README badge #7

Open snomos opened 3 years ago

snomos commented 3 years ago

For many languages (and all production languages) we have a collection of typos + expected correction. We have used this list in the past for one of several speller evalutations. It is in principle straightforward to run: send typos in, compare suggestions to expected, and calculate various statistics.

The process requires relatively little processing, is reasonably fast using divvunspell, and is thus ok to run on each push as part of a build pipeline (another pipeline usecase @bbqsrc ).

What I suggest is something along the following lines:

To avoid circularity (one push triggers the test above, which creates new or updated documents, which are then committed, thus triggering a new run, etc) there should be a test that changes to the doc/ dir does not trigger a speller test run (assuming the new/updated documents are stored in the doc/ dir).

When this is in place, then do similar things for other tools in addition to spellers.

snomos commented 3 years ago

Until the pipelines are working, creating the statistics could be a manual process triggered by a separate make target.

flammie commented 3 years ago

the markdown statistics could be stored in github pages docs/, I've done this for omorfi like https://flammie.github.io/omorfi/statistics.html. This can be eventually automated neatly I hope but I have not found clean example of such github pages automation, there's a ton of actions for it but they all look overly complicated for generating few pages.

snomos commented 3 years ago

Nice statistics page! Ours does not have to be as detailed and rich as that one, but if it could be automatised one way or another then why not.

snomos commented 3 years ago

See also https://github.com/divvun/divvunspell and the section on speller testing. It presently renders output like the following:

Bilde 27 10 2020 klokken 15 12

As seen in the screen shot, the speller testing is fast, testing 1315 words in just a fraction more than a second.