Add more badges to README

snomos commented 4 years ago

Presently (in the template) we have two badges:

nº of open issues
build status

This is the same as in the registry. But we want more in the README's!

Here are some ideas:

All languages:
- [x] lexicon size (LOC? # lemmas?) (to be calculated on-the-fly, see this for a discussion and some further links)
- [ ] coverage, based on open corpus or wikipedia, if available (must be pre-calculated, updated regularly)
- [x] License logo - requires lisence cleanup
- [x] Maturity in four levels: Production (green), beta (yellow), startup (red), experiment/student exercises (black)
  - [x] Add to registry
  - [x] Add to all README's
- Download badge for latest nightly bhfst file:
  - [x] desktop
  - [x] mobile
Production languages:
- average position of correct speller suggestion? must be precalculated, requires marked-up corpus
- F-score?
- speller perfomance on typos list - separate document, updated on each push

The overall goal is to give an immediate impression of the quality of the language model, so that possible users know what to expect, how to use etc.

Ideas for implementation and other badgeable measures most welcome!

Trondtr commented 2 years ago

LEXICON size: The linguistic answer is: In the stems/ catalogue, count the number of lines containing semicolon. These are the lexical entries.

Coverage: This can be done automatically. If the goal is to compare language models, the corpus should be a preselected, general purpose corpus small enough to be available for many languages. We could evt. have production, beta, alpha and experimental coverage, with arbitrary corpora of, say 100000, 10000, 1000 and 100 words. The respective missing lists should not be corrected as such, but the corpora need not be taken away from ordinary development (the corpora should just give an impressionistic picture). Wikipedia is a special corpus not so well fit for coverage, for many reasons: Most of our languages do not have it, or have bad versions of it; Wikipedia contains a high percentage of foreign names (French and Turkish municipalities, etc.), wp versions are of very different sizes, ...

Download badge: We want to be able to offer downloadable analysers of many different types.

Disambiguators should be measured as well (cf. the literature for ideas). It is possible to just measure level of disambiguation (counting analyses per word)

snomos commented 7 months ago

More on the Coverage badge:

Now that we have all corpus data in GitHub, and preconverted versions in separate repos, it should be easy to automatise this. Something along these lines:

build analysers
clone converted corpus repo (only the open corpus repo)
analyse full corpus
count ratio of analysed tokens vs all tokens
store result in a small json file in the gh-pages branch
use the json end point to build a badge, as we already do for lexicon size

Even though we only use the open corpus data, it should be a good approximation, close enough to be informative.

flammie commented 7 months ago

The coverage would be really interesting, for many corpora it can take a lot of processing power and time though (e.g. on my machine if I need to analyse or gramcheck freecorp of sme ccat corpus-sme/converted |tools/analyser/modes/... it's easily an overnight job). I am thinking that if it's implemented it would make sense to use the corpus run script to upload several objects like frequency sorted missing lists that can be extracted from the result quite easily.

snomos commented 7 months ago

Good point. I also had in mind (but forgot to write) that this should be done as a cron-like job in CI (GH Actions can be specified like that), so that this is only run once a week or so. And we may want to limit the processing to steps that do not take an unrealistic amount of time.

albbas commented 7 months ago

analyse_corpus is part of CorpusTools, it parallelizes the analysis, and on a laptop with many cores it's done quite fast. Perhaps that's an option?

flammie commented 7 months ago

yeah, I assume the CI job will be ran on taskcluster, what sort of machine is that do we know?

giellalt / template-lang-und

Add more badges to README #6