Create 'analysis' job/script

sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.

Other

30 stars 4 forks source link

Create 'analysis' job/script #301

Open Enkidu93 opened 6 months ago

Enkidu93 commented 6 months ago

Subtask of https://github.com/sillsdev/serval/issues/258

A new job to be run as part of the on-boarding process, including the following information as output:

Capture per-book verse counts.
Analyze empty/partial/completed book status for each.
Compare empty/partial/completed book status between vernacular and each potential source (primary, secondary, and back translation).
(opt) General statistics on each text - characters/words, characters/verse, words/verse, characters/token, tokens/verse, unknown token counts.
(opt) Wildebeest character checking report on vernacular and back translation(s).

Wildebeest will probably be deferred to a later time.

Enkidu93 commented 5 months ago

For now, the general stats will be recorded during preprocessing (since the tokenization info is present there), and since the main other item is the verse counts, I will extend the existing verse count script rather than make a whole new analysis script. However, once we're in a place to incorporate other information - e.g. Ethnologue data, script information (whether from wildebeest or something more naive), etc. - it may be best to transition back to the original idea of an silnlp analysis script with command-line flags mapping to a suite of functions to gather each of these different kinds of stats (both for convenience and for consistency in regards to the on-boarding process).

davidbaines commented 3 months ago

The current --stats option for the preprocessing script has a few issues:

The 'count' and 'project counts' both seem to report the number of verses in common, so we really need source_verses, target_verses, verses_in_common.

The code always reports that none of the scripts are in the model. The code fails to report the script when there are no verses in common.