Open snomos opened 4 years ago
LEXICON size: The linguistic answer is: In the stems/
catalogue, count the number of lines containing semicolon. These are the lexical entries.
Coverage: This can be done automatically. If the goal is to compare language models, the corpus should be a preselected, general purpose corpus small enough to be available for many languages. We could evt. have production, beta, alpha and experimental coverage, with arbitrary corpora of, say 100000, 10000, 1000 and 100 words. The respective missing lists should not be corrected as such, but the corpora need not be taken away from ordinary development (the corpora should just give an impressionistic picture). Wikipedia is a special corpus not so well fit for coverage, for many reasons: Most of our languages do not have it, or have bad versions of it; Wikipedia contains a high percentage of foreign names (French and Turkish municipalities, etc.), wp versions are of very different sizes, ...
Download badge: We want to be able to offer downloadable analysers of many different types.
Disambiguators should be measured as well (cf. the literature for ideas). It is possible to just measure level of disambiguation (counting analyses per word)
More on the Coverage badge:
Now that we have all corpus data in GitHub, and preconverted versions in separate repos, it should be easy to automatise this. Something along these lines:
gh-pages
branchEven though we only use the open corpus data, it should be a good approximation, close enough to be informative.
The coverage would be really interesting, for many corpora it can take a lot of processing power and time though (e.g. on my machine if I need to analyse or gramcheck freecorp of sme ccat corpus-sme/converted |tools/analyser/modes/...
it's easily an overnight job). I am thinking that if it's implemented it would make sense to use the corpus run script to upload several objects like frequency sorted missing lists that can be extracted from the result quite easily.
Good point. I also had in mind (but forgot to write) that this should be done as a cron-like job in CI (GH Actions can be specified like that), so that this is only run once a week or so. And we may want to limit the processing to steps that do not take an unrealistic amount of time.
analyse_corpus
is part of CorpusTools, it parallelizes the analysis, and on a laptop with many cores it's done quite fast. Perhaps that's an option?
yeah, I assume the CI job will be ran on taskcluster, what sort of machine is that do we know?
Presently (in the template) we have two badges:
This is the same as in the registry. But we want more in the README's!
Here are some ideas:
The overall goal is to give an immediate impression of the quality of the language model, so that possible users know what to expect, how to use etc.
Ideas for implementation and other badgeable measures most welcome!