unfoldingWord / uw-word-count

https://unfoldingword.github.io/uw-word-count/
Other
1 stars 1 forks source link

Word Count Repository - Non English Languages #4

Open mandolyte opened 4 years ago

mandolyte commented 4 years ago

Handling non-English languages is a known problem. The word count logic depends heavily on regular expressions, which do not work for all languages. There are several cases to consider: a) Markdown: this one might be easily solved by converting markdown to HTML, then using an innerHtml function to extract all the text (in any language). The text would need to be run thru the UW tokenizer to actually split the extracted text into individual words. b) USFM (aligned): each word is already split into individual words c) USFM (unaligned): the text is not split, but is readily extractable; would need to run thru the tokenizer to get the individual words d) UTN: with some pre-processing, can be handled as Markdown.

@jag3773 @klappy - Please comment as needed.

jag3773 commented 4 years ago

Noted and understood. Word counts for our English resources is the primary goal here as that forms the basis for what GLs will need to translate. For now, the limitation is acceptable. We can circle back around when we need to.