Open Enkidu93 opened 6 months ago
For now, the general stats will be recorded during preprocessing (since the tokenization info is present there), and since the main other item is the verse counts, I will extend the existing verse count script rather than make a whole new analysis script. However, once we're in a place to incorporate other information - e.g. Ethnologue data, script information (whether from wildebeest or something more naive), etc. - it may be best to transition back to the original idea of an silnlp analysis script with command-line flags mapping to a suite of functions to gather each of these different kinds of stats (both for convenience and for consistency in regards to the on-boarding process).
The current --stats option for the preprocessing script has a few issues:
The 'count' and 'project counts' both seem to report the number of verses in common, so we really need source_verses, target_verses, verses_in_common.
The code always reports that none of the scripts are in the model. The code fails to report the script when there are no verses in common.
Subtask of https://github.com/sillsdev/serval/issues/258
A new job to be run as part of the on-boarding process, including the following information as output:
Wildebeest will probably be deferred to a later time.